Deep Reinforcement Learning with Double Q-Learning

Size: px

Start display at page:

Download "Deep Reinforcement Learning with Double Q-Learning"

Marjorie Anthony
5 years ago
Views:

1 Proceedings of he Thirieh AAAI Conference on Arificil Inelligence (AAAI-6) Deep Reinforcemen Lerning wih Double Q-Lerning Hdo vn Hssel, Arhur Guez, nd Dvid Silver Google DeepMind Absrc The populr Q-lerning lgorihm is known o overesime cion vlues under cerin condiions. I ws no previously known wheher, in prcice, such overesimions re common, wheher hey hrm performnce, nd wheher hey cn generlly be prevened. In his pper, we nswer ll hese quesions ffirmively. In priculr, we firs show h he recen lgorihm, which combines Q-lerning wih deep neurl nework, suffers from subsnil overesimions in some gmes in he Ari 6 domin. We hen show h he ide behind he Double Q-lerning lgorihm, which ws inroduced in bulr seing, cn be generlized o work wih lrge-scle funcion pproximion. We propose specific dpion o he lgorihm nd show h he resuling lgorihm no only reduces he observed overesimions, s hypohesized, bu h his lso leds o much beer performnce on severl gmes. The gol of reinforcemen lerning (Suon nd Bro 998) is o lern good policies for sequenil decision problems, by opimizing cumulive fuure rewrd signl. Q-lerning (Wkins 989) is one of he mos populr reinforcemen lerning lgorihms, bu i is known o someimes lern unrelisiclly high cion vlues becuse i includes mximizion sep over esimed cion vlues, which ends o prefer overesimed o underesimed vlues. In previous work, overesimions hve been ribued o insufficienly flexible funcion pproximion (Thrun nd Schwrz 993) nd noise (vn Hssel, ). In his pper, we unify hese views nd show overesimions cn occur when he cion vlues re inccure, irrespecive of he source of pproximion error. Of course, imprecise vlue esimes re he norm during lerning, which indices h overesimions my be much more common hn previously pprecied. I is n open quesion wheher, if he overesimions do occur, his negively ffecs performnce in prcice. Overopimisic vlue esimes re no necessrily problem in nd of hemselves. If ll vlues would be uniformly higher hen he relive cion preferences re preserved nd we would no expec he resuling policy o be ny worse. Furhermore, i is known h someimes i is good o be opimisic: opimism in he fce of unceriny is well-known Copyrigh c 6, Associion for he Advncemen of Arificil Inelligence ( All righs reserved. explorion echnique (Kelbling e l. 996). If, however, he overesimions re no uniform nd no concenred ses bou which we wish o lern more, hen hey migh negively ffec he quliy of he resuling policy. Thrun nd Schwrz (993) give specific exmples in which his leds o subopiml policies, even sympoiclly. To es wheher overesimions occur in prcice nd scle, we invesige he performnce of he recen lgorihm (Mnih e l. 5). combines Q-lerning wih flexible deep neurl nework nd ws esed on vried nd lrge se of deerminisic Ari 6 gmes, reching humn-level performnce on mny gmes. In some wys, his seing is bes-cse scenrio for Q-lerning, becuse he deep neurl nework provides flexible funcion pproximion wih he poenil for low sympoic pproximion error, nd he deerminism of he environmens prevens he hrmful effecs of noise. Perhps surprisingly, we show h even in his comprively fvorble seing someimes subsnilly overesimes he vlues of he cions. We show h he Double Q-lerning lgorihm (vn Hssel ), which ws firs proposed in bulr seing, cn be generlized o rbirry funcion pproximion, including deep neurl neworks. We use his o consruc new lgorihm clled. This lgorihm no only yields more ccure vlue esimes, bu leds o much higher scores on severl gmes. This demonsres h he overesimions of indeed led o poorer policies nd h i is beneficil o reduce hem. In ddiion, by improving upon we obin se-of-he-r resuls on he Ari domin. Bckground To solve sequenil decision problems we cn lern esimes for he opiml vlue of ech cion, defined s he expeced sum of fuure rewrds when king h cion nd following he opiml policy herefer. Under given policy π, he rue vlue of n cion in se s is Q π (s, ) E [R + γr +... S = s, A =, π], where γ [, ] is discoun fcor h rdes off he impornce of immedie nd ler rewrds. The opiml vlue is hen Q (s, ) =mx π Q π (s, ). An opiml policy is esily derived from he opiml vlues by selecing he highesvlued cion in ech se. 9

2 Esimes for he opiml cion vlues cn be lerned using Q-lerning (Wkins 989), form of emporl difference lerning (Suon 988). Mos ineresing problems re oo lrge o lern ll cion vlues in ll ses seprely. Insed, we cn lern prmeerized vlue funcion Q(s, ; θ ). The sndrd Q-lerning upde for he prmeers fer king cion A in se S nd observing he immedie rewrd R + nd resuling se S + is hen θ + = θ +α(y Q Q(S,A ; θ )) θ Q(S,A ; θ ). () where α is sclr sep size nd he rge Y Q is defined s Y Q R + + γ mx Q(S +,; θ ). () This upde resembles sochsic grdien descen, upding he curren vlue Q(S,A ; θ ) owrds rge vlue Y Q. Deep Q Neworks A deep Q nework () is muli-lyered neurl nework h for given se s oupus vecor of cion vlues Q(s, ; θ), where θ re he prmeers of he nework. For n n-dimensionl se spce nd n cion spce conining m cions, he neurl nework is funcion from R n o R m. Two imporn ingrediens of he lgorihm s proposed by Mnih e l. (5) re he use of rge nework, nd he use of experience reply. The rge nework, wih prmeers θ, is he sme s he online nework excep h is prmeers re copied every τ seps from he online nework, so h hen θ = θ, nd kep fixed on ll oher seps. The rge used by is hen Y R + + γ mx Q(S +,; θ ). (3) For he experience reply (Lin 99), observed rnsiions re sored for some ime nd smpled uniformly from his memory bnk o upde he nework. Boh he rge nework nd he experience reply drmiclly improve he performnce of he lgorihm (Mnih e l. 5). Double Q-lerning The mx operor in sndrd Q-lerning nd, in () nd (3), uses he sme vlues boh o selec nd o evlue n cion. This mkes i more likely o selec overesimed vlues, resuling in overopimisic vlue esimes. To preven his, we cn decouple he selecion from he evluion. In Double Q-lerning (vn Hssel ), wo vlue funcions re lerned by ssigning experiences rndomly o upde one of he wo vlue funcions, resuling in wo ses of weighs, θ nd θ. For ech upde, one se of weighs is used o deermine he greedy policy nd he oher o deermine is vlue. For cler comprison, we cn unngle he selecion nd evluion in Q-lerning nd rewrie is rge () s Y Q = R + + γq(s +, rgmx Q(S +,; θ ); θ ). The Double Q-lerning error cn hen be wrien s Y DoubleQ R + + γq(s +, rgmx Q(S +,; θ ); θ ). () Noice h he selecion of he cion, in he rgmx, is sill due o he online weighs θ. This mens h, s in Q- lerning, we re sill esiming he vlue of he greedy policy ccording o he curren vlues, s defined by θ.however, we use he second se of weighs θ o firly evlue he vlue of his policy. This second se of weighs cn be upded symmericlly by swiching he roles of θ nd θ. Overopimism due o esimion errors Q-lerning s overesimions were firs invesiged by Thrun nd Schwrz (993), who showed h if he cion vlues conin rndom errors uniformly disribued in n inervl [ ɛ, ɛ] hen ech rge is overesimed up o γɛ m m+, where m is he number of cions. In ddiion, Thrun nd Schwrz give concree exmple in which hese overesimions even sympoiclly led o sub-opiml policies, nd show he overesimions mnifes hemselves in smll oy problem when using funcion pproximion. Vn Hssel () noed h noise in he environmen cn led o overesimions even when using bulr represenion, nd proposed Double Q-lerning s soluion. In his secion we demonsre more generlly h esimion errors of ny kind cn induce n upwrd bis, regrdless of wheher hese errors re due o environmenl noise, funcion pproximion, non-sionriy, or ny oher source. This is imporn, becuse in prcice ny mehod will incur some inccurcies during lerning, simply due o he fc h he rue vlues re iniilly unknown. The resul by Thrun nd Schwrz (993) cied bove gives n upper bound o he overesimion for specific seup, bu i is lso possible, nd poenilly more ineresing, o derive lower bound. Theorem. Consider se s in which ll he rue opiml cion vlues re equl Q (s, ) =V (s) for some V (s). Le Q be rbirry vlue esimes h re on he whole unbised in he sense h (Q (s, ) V (s)) =, bu h re no ll correc, such h m (Q (s, ) V (s)) = C for some C>, where m is he number of cions in s. Under hese condiions, mx Q (s, ) V (s)+ C m. This lower bound is igh. Under he sme condiions, he lower bound on he bsolue error of he Double Q-lerning esime is zero. (Proof in ppendix.) Noe h we did no need o ssume h esimion errors for differen cions re independen. This heorem shows h even if he vlue esimes re on verge correc, esimion errors of ny source cn drive he esimes up nd wy from he rue opiml vlues. The lower bound in Theorem decreses wih he number of cions. This is n rifc of considering he lower bound, which requires very specific vlues o be ined. More ypiclly, he overopimism increses wih he number of cions s shown in Figure. Q-lerning s overesimions here indeed increse wih he number of cions, while Double Q-lerning is unbised. As noher exmple, if for ll cions Q (s, ) =V (s) nd he esimion errors Q (s, ) V (s) re uniformly rndom in [, ], hen he overopimism is m m+. (Proof in ppendix.) 95

3 error number of cions mx Q(s, ) V (s) Q (s, rgmx Q(s, )) V (s) Figure : The ornge brs show he bis in single Q- lerning upde when he cion vlues re Q(s, ) = V (s)+ɛ nd he errors {ɛ } m = re independen sndrd norml rndom vribles. The second se of cion vlues Q, used for he blue brs, ws genered ideniclly nd independenly. All brs re he verge of repeiions. We now urn o funcion pproximion nd consider rel-vlued coninuous se spce wih discree cions in ech se. For simpliciy, he rue opiml cion vlues in his exmple depend only on se so h in ech se ll cions hve he sme rue vlue. These rue vlues re shown in he lef column of plos in Figure (purple lines) nd re defined s eiher Q (s, ) = sin(s) (op row) or Q (s, ) =exp( s ) (middle nd boom rows). The lef plos lso show n pproximion for single cion (green lines) s funcion of se s well s he smples he esime is bsed on (green dos). The esime is d-degree polynomil h is fi o he rue vlues smpled ses, where d = 6 (op nd middle rows) or d = 9 (boom row). The smples mch he rue funcion excly: here is no noise nd we ssume we hve ground ruh for he cion vlue on hese smpled ses. The pproximion is inexc even on he smpled ses for he op wo rows becuse he funcion pproximion is insufficienly flexible. In he boom row, he funcion is flexible enough o fi he green dos, bu his reduces he ccurcy in unsmpled ses. Noice h he smpled ses re spced furher pr ner he lef side of he lef plos, resuling in lrger esimion errors. In mny wys his is ypicl lerning seing, where ech poin in ime we only hve limied d. The middle column of plos in Figure shows esimed cion vlues for ll cions (green lines), s funcions of se, long wih he mximum cion vlue in ech se (blck dshed line). Alhough he rue vlue funcion is he sme for ll cions, he pproximions differ becuse hey re bsed on differen ses of smpled ses. The mximum is ofen higher hn he ground ruh shown in purple on he lef. This is confirmed in he righ plos, which shows he difference beween he blck nd purple curves in ornge. The ornge line is lmos lwys posiive, indicing n upwrd bis. The righ plos lso show he esimes from Double Ech cion-vlue funcion is fi wih differen subse of ineger ses. Ses 6 nd 6 re lwys included o void exrpolions, nd for ech cion wo djcen inegers re missing: for cion ses 5 nd re no smpled, for ses nd 3 re no smpled, nd so on. This cuses he esimed vlues o differ. Q-lerning in blue, which re on verge much closer o zero. This demonsres h Double Q-lerning indeed cn successfully reduce he overopimism of Q-lerning. The differen rows in Figure show vriions of he sme experimen. The difference beween he op nd middle rows is he rue vlue funcion, demonsring h overesimions re no n rifc of specific rue vlue funcion. The difference beween he middle nd boom rows is he flexibiliy of he funcion pproximion. In he lef-middle plo, he esimes re even incorrec for some of he smpled ses becuse he funcion is insufficienly flexible. The funcion in he boom-lef plo is more flexible bu his cuses higher esimion errors for unseen ses, resuling in higher overesimions. This is imporn becuse flexible prmeric funcion pproximion is ofen employed in reinforcemen lerning (see, e.g., Tesuro 995, Sllns nd Hinon, Riedmiller 5, nd Mnih e l. 5). In conrs o vn Hssel (), we did no use sisicl rgumen o find overesimions, he process o obin Figure is fully deerminisic. In conrs o Thrun nd Schwrz (993), we did no rely on inflexible funcion pproximion wih irreducible sympoic errors; he boom row shows h funcion h is flexible enough o cover ll smples leds o high overesimions. This indices h he overesimions cn occur quie generlly. In he exmples bove, overesimions occur even when ssuming we hve smples of he rue cion vlue cerin ses. The vlue esimes cn furher deeriore if we boosrp off of cion vlues h re lredy overopimisic, since his cuses overesimions o propge hroughou our esimes. Alhough uniformly overesiming vlues migh no hur he resuling policy, in prcice overesimion errors will differ for differen ses nd cions. Overesimion combined wih boosrpping hen hs he pernicious effec of propging he wrong relive informion bou which ses re more vluble hn ohers, direcly ffecing he quliy of he lerned policies. The overesimions should no be confused wih opimism in he fce of unceriny (Suon 99, Agrwl 995, Kelbling e l. 996, Auer l., Brfmn nd Tennenholz 3, Szi nd Lõrincz 8, Srehl nd Limn 9), where n explorion bonus is given o ses or cions wih uncerin vlues. The overesimions discussed here occur only fer upding, resuling in overopimism in he fce of ppren ceriny. Thrun nd Schwrz (993) noed h, in conrs o opimism in he fce of unceriny, hese overesimions cully cn impede lerning n opiml policy. We confirm his negive effec on policy quliy in our experimens: when we reduce he overesimions using Double Q-lerning, he policies improve. The ide of Double Q-lerning is o reduce overesimions by decomposing he mx operion in he rge ino cion We rbirrily used he smples of cion i+5 (for i 5) or i 5 (for i>5) s he second se of smples for he double esimor of cion i. 96

4 True vlue nd n esime Q (s, ) Q (s, ) All esimes nd mx mx Q (s, ) Bis s funcion of se mx Q (s, ) mx Q (s, ) Double-Q esime Averge error +.6. Q (s, ) Q (s, ) mx Q (s, ) mx Q (s, ) mx Q (s, ) Double-Q esime Q (s, ) mx Q (s, ) mx Q (s, ) mx Q (s, ) Q (s, ) 6 6 se 6 6 se Double-Q esime 6 6 se. Figure : Illusrion of overesimions during lerning. In ech se (x-xis), here re cions. The lef column shows he rue vlues V (s) (purple line). All rue cion vlues re defined by Q (s, ) =V (s). The green line shows esimed vlues Q(s, ) for one cion s funcion of se, fied o he rue vlue severl smpled ses (green dos). The middle column plos show ll he esimed vlues (green), nd he mximum of hese vlues (dshed blck). The mximum is higher hn he rue vlue (purple, lef plo) lmos everywhere. The righ column plos shows he difference in ornge. The blue line in he righ plos is he esime used by Double Q-lerning wih second se of smples for ech se. The blue line is much closer o zero, indicing less bis. The hree rows correspond o differen rue funcions (lef, purple) or cpciies of he fied funcion (lef, green). (Deils in he ex) selecion nd cion evluion. Alhough no fully decoupled, he rge nework in he rchiecure provides nurl cndide for he second vlue funcion, wihou hving o inroduce ddiionl neworks. We herefore propose o evlue he greedy policy ccording o he online nework, bu using he rge nework o esime is vlue. In reference o boh Double Q-lerning nd, we refer o he resuling lgorihm s. Is upde is he sme s for, bu replcing he rge Y wih Y Double R + + γq(s +, rgmx Q(S +,; θ ), θ ). In comprison o Double Q-lerning (), he weighs of he second nework θ re replced wih he weighs of he rge nework θ for he evluion of he curren greedy policy. The upde o he rge nework sys unchnged from, nd remins periodic copy of he online nework. This version of is perhps he miniml possible chnge o owrds Double Q-lerning. The gol is o ge mos of he benefi of Double Q-lerning, while keeping he res of he lgorihm inc for fir comprison, nd wih miniml compuionl overhed. Empiricl resuls In his secion, we nlyze he overesimions of nd show h improves over boh in erms of vlue ccurcy nd in erms of policy quliy. To furher es he robusness of he pproch we ddiionlly evlue he lgorihms wih rndom srs genered from exper humn rjecories, s proposed by Nir e l. (5). Our esbed consiss of Ari 6 gmes, using he Arcde Lerning Environmen (Bellemre e l. 3). The gol is for single lgorihm, wih fixed se of hyperprmeers, o lern o ply ech of he gmes seprely from inercion given only he screen pixels s inpu. This is demnding esbed: no only re he inpus high-dimensionl, he gme visuls nd gme mechnics vry subsnilly beween gmes. Good soluions mus herefore rely hevily on he lerning lgorihm i is no prciclly fesible o overfi he domin by relying only on uning. We closely follow he experimenl seup nd nework rchiecure used by Mnih e l. (5). Briefly, he nework rchiecure is convoluionl neurl nework (Fukushim 988, Lecun e l. 998) wih 3 convoluion lyers nd fully-conneced hidden lyer (pproximely.5m prmeers in ol). The nework kes he ls four frmes s inpu nd oupus he cion vlue of ech cion. On ech gme, he nework is rined on single GPU for M frmes. Resuls on overopimism Figure 3 shows exmples of s overesimions in six Ari gmes. nd were boh rined under he exc condiions described by Mnih e l. (5). is consisenly nd someimes vsly overopimisic bou he vlue of he curren greedy policy, s cn be seen by compring he ornge lerning curves in he op row of plos o he srigh ornge lines, which represen he cul discouned vlue of he bes lerned policy. More precisely, he (verged) vlue esimes re compued regulrly during rining wih full evluion phses of lengh T = 5, seps s T T = rgmx Q(S,; θ). 97

5 Vlue esimes Alien Spce Invders.5. Time Pilo 8 esime.5 esime. rue vlue rue vlue Trining seps (in millions) 6 Zxxon Vlue esimes (log scle) Score 3 Wizrd of Wor 5 5 Wizrd of Wor 5 5 Trining seps (in millions) Aserix 5 5 Aserix 5 5 Trining seps (in millions) Figure 3: The op nd middle rows show vlue esimes by (ornge) nd (blue) on six Ari gmes. The resuls re obined by running nd wih 6 differen rndom seeds wih he hyper-prmeers employed by Mnih e l. (5). The drker line shows he medin over seeds nd we verge he wo exreme vlues o obin he shded re (i.e., % nd 9% quniles wih liner inerpolion). The srigh horizonl ornge (for ) nd blue (for Double ) lines in he op row re compued by running he corresponding gens fer lerning concluded, nd verging he cul discouned reurn obined from ech visied se. These srigh lines would mch he lerning curves he righ side of he plos if here is no bis. The middle row shows he vlue esimes (in log scle) for wo gmes in which s overopimism is quie exreme. The boom row shows he derimenl effec of his on he score chieved by he gen s i is evlued during rining: he scores drop when he overesimions begin. Lerning wih is much more sble. The ground ruh verged vlues re obined by running he bes lerned policies for severl episodes nd compuing he cul cumulive rewrds. Wihou overesimions we would expec hese quniies o mch up (i.e., he curve o mch he srigh line he righ of ech plo). Insed, he lerning curves of consisenly end up much higher hn he rue vlues. The lerning curves for, shown in blue, re much closer o he blue srigh line represening he rue vlue of he finl policy. Noe h he blue srigh line is ofen higher hn he ornge srigh line. This indices h does no jus produce more ccure vlue esimes bu lso beer policies. More exreme overesimions re shown in he middle wo plos, where is highly unsble on he gmes Aserix nd Wizrd of Wor. Noice he log scle for he vlues on he y-xis. The boom wo plos shows he corresponding scores for hese wo gmes. Noice h he increses in vlue esimes for in he middle plos coincide wih decresing scores in boom plos. Agin, his indices h he overesimions re hrming he quliy of he resuling policies. If seen in isolion, one migh perhps be emped o hink he observed insbiliy is reled o inheren insbiliy problems of off-policy lerning wih funcion pproximion (Bird 995, Tsisiklis nd Vn Roy 997, Mei no ops humn srs D D D (uned) Medin 93% 5% 7% 88% 7% Men % 33% % 73% 75% Tble : Summrized normlized performnce on 9 gmes for up o 5 minues wih up o 3 no ops he sr of ech episode, nd for up o 3 minues wih rndomly seleced humn sr poins. Resuls for re from Mnih e l. (5) (no ops) nd Nir e l. (5) (humn srs)., Suon e l. 5). However, we see h lerning is much more sble wih, suggesing h he cuse for hese insbiliies is in fc Q-lerning s overopimism. Figure 3 only shows few exmples, bu overesimions were observed for in ll 9 esed Ari gmes, lbei in vrying mouns. Quliy of he lerned policies Overopimism does no lwys dversely ffec he quliy of he lerned policy. For exmple, chieves opiml 98

6 behvior in Pong despie slighly overesiming he policy vlue. Neverheless, reducing overesimions cn significnly benefi he sbiliy of lerning; we see cler exmples of his in Figure 3. We now ssess more generlly how much helps in erms of policy quliy by evluing on ll 9 gmes h ws esed on. As described by Mnih e l. (5) ech evluion episode srs by execuing specil no-op cion h does no ffec he environmen up o 3 imes, o provide differen sring poins for he gen. Some explorion during evluion provides ddiionl rndomizion. For Double we used he exc sme hyper-prmeers s for, o llow for conrolled experimen focused jus on reducing overesimions. The lerned policies re evlued for 5 mins of emulor ime (8, frmes) wih n ɛ- greedy policy where ɛ =.5. The scores re verged over episodes. The only difference beween nd is he rge, using Y Double rher hn Y. This evluion is somewh dversril, s he used hyperprmeers were uned for bu no for. To obin summry sisics cross gmes, we normlize he score for ech gme s follows: score normlized = score gen score rndom. (5) score humn score rndom The rndom nd humn scores re he sme s used by Mnih e l. (5), nd re given in he ppendix. Tble, under no ops, shows h on he whole Double clerly improves over. A deiled comprison (in ppendix) shows h here re severl gmes in which grely improves upon. Noeworhy exmples include Rod Runner (from 33% o 67%), Aserix (from 7% o 8%), Zxxon (from 5% o %), nd Double Dunk (from 7% o 397%). The Goril lgorihm (Nir e l. 5), which is mssively disribued version of, is no included in he ble becuse he rchiecure nd infrsrucure is sufficienly differen o mke direc comprison uncler. For compleeness, we noe h Goril obined medin nd men normlized scores of 96% nd 95%, respecively. Robusness o Humn srs One concern wih he previous evluion is h in deerminisic gmes wih unique sring poin he lerner could poenilly lern o remember sequences of cions wihou much need o generlize. While successful, he soluion would no be priculrly robus. By esing he gens from vrious sring poins, we cn es wheher he found soluions generlize well, nd s such provide chllenging esbed for he lerned polices (Nir e l. 5). We obined sring poins smpled for ech gme from humn exper s rjecory, s proposed by Nir e l. (5). We sr n evluion episode from ech of hese sring poins nd run he emulor for up o 8, frmes (3 mins 6Hz including he rjecory before he sring poin). Ech gen is only evlued on he rewrds ccumuled fer he sring poin. For his evluion we include uned version of Double. Some uning is pproprie becuse he hyperprme- Video Pinbll Alnis Demon Ack Brekou Assul Double Dunk Robonk Gopher Boxing Sr Gunner Rod Runner Krull Crzy Climber Kngroo Aserix Defender Phoenix Up nd Down Spce Invders Jmes Bond Enduro Kung-Fu Mser Wizrd of Wor Nme This Gme Time Pilo Bnk Heis Bem Rider Freewy Pong Zxxon Fishing Derby Tennis Q*Ber Surround River Rid Ble Zone Ice Hockey Tunkhm H.E.R.O. Berzerk Seques Chopper Commnd Frosbie Skiing Bowling Cenipede Alien Yrs Revenge Amidr Ms. Pcmn Pifll Aseroids Monezum s Revenge Venure Grvir Prive Eye Solris % % Humn % (uned) 3% % Normlized score 75% 5% 5% % 5% % 5% Figure : Normlized scores on 57 Ari gmes, esed for episodes per gme wih humn srs. Compred o Mnih e l. (5), eigh gmes ddiionl gmes were esed. These re indiced wih srs nd bold fon. ers were uned for, which is differen lgorihm. For he uned version of, we incresed he number of frmes beween ech wo copies of he rge nework from, o 3,, o reduce overesimions furher becuse immediely fer ech swich nd boh rever o Q-lerning. In ddiion, we reduced he explorion during lerning from ɛ =. o ɛ =., nd hen used ɛ =. during evluion. Finlly, he uned version uses single shred bis for ll cion vlues in he op lyer of he nework. Ech of hese chnges improved performnce nd ogeher hey resul in clerly beer resuls. 3 Tble repors summry sisics for his evluion (under humn srs) on he 9 gmes from Mnih e l. (5). obins clerly higher medin nd 3 Excep for Tennis, where he lower ɛ during rining seemed o hur rher hn help. 99

7 men scores. Agin Goril (Nir e l. 5) is no included in he ble, bu for compleeness noe i obined medin of 78% nd men of 59%. Deiled resuls, plus resuls for n ddiionl 8 gmes, re vilble in Figure nd in he ppendix. On severl gmes he improvemens from o re sriking, in some cses bringing scores much closer o humn, or even surpssing hese. ppers more robus o his more chllenging evluion, suggesing h pproprie generlizions occur nd h he found soluions do no exploi he deerminism of he environmens. This is ppeling, s i indices progress owrds finding generl soluions rher hn deerminisic sequence of seps h would be less robus. Discussion This pper hs five conribuions. Firs, we hve shown why Q-lerning cn be overopimisic in lrge-scle problems, even if hese re deerminisic, due o he inheren esimion errors of lerning. Second, by nlyzing he vlue esimes on Ari gmes we hve shown h hese overesimions re more common nd severe in prcice hn previously cknowledged. Third, we hve shown h Double Q-lerning cn be used scle o successfully reduce his overopimism, resuling in more sble nd relible lerning. Fourh, we hve proposed specific implemenion clled, h uses he exising rchiecure nd deep neurl nework of he lgorihm wihou requiring ddiionl neworks or prmeers. Finlly, we hve shown h finds beer policies, obining new se-ofhe-r resuls on he Ari 6 domin. Acknowledgmens We would like o hnk Tom Schul, Volodymyr Mnih, Mrc Bellemre, Thoms Degris, Georg Osrovski, nd Richrd Suon for helpful commens, nd everyone Google Deep- Mind for consrucive reserch environmen. References R. Agrwl. Smple men bsed index policies wih O(log n) regre for he muli-rmed bndi problem. Advnces in Applied Probbiliy, pges 5 78, 995. P. Auer, N. Ces-Binchi, nd P. Fischer. Finie-ime nlysis of he mulirmed bndi problem. Mchine lerning, 7(-3):35 56,. L. Bird. Residul lgorihms: Reinforcemen lerning wih funcion pproximion. In Mchine Lerning: Proceedings of he Twelfh Inernionl Conference, pges 3 37, 995. M. G. Bellemre, Y. Nddf, J. Veness, nd M. Bowling. The rcde lerning environmen: An evluion plform for generl gens. J. Arif. Inell. Res. (JAIR), 7:53 79, 3. R. I. Brfmn nd M. Tennenholz. R-mx- generl polynomil ime lgorihm for ner-opiml reinforcemen lerning. The Journl of Mchine Lerning Reserch, 3:3 3, 3. K. Fukushim. Neocogniron: A hierrchicl neurl nework cpble of visul pern recogniion. Neurl neworks, ():9 3, 988. L. P. Kelbling, M. L. Limn, nd A. W. Moore. Reinforcemen lerning: A survey. Journl of Arificil Inelligence Reserch, : 37 85, 996. Y. LeCun, L. Boou, Y. Bengio, nd P. Hffner. Grdien-bsed lerning pplied o documen recogniion. Proceedings of he IEEE, 86():78 3, 998. L. Lin. Self-improving recive gens bsed on reinforcemen lerning, plnning nd eching. Mchine lerning, 8(3):93 3, 99. H. R. Mei. Grdien emporl-difference lerning lgorihms. PhD hesis, Universiy of Alber,. V. Mnih, K. Kvukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemre, A. Grves, M. Riedmiller, A. K. Fidjelnd, G. Osrovski, S. Peersen, C. Beie, A. Sdik, I. Anonoglou, H. King, D. Kumrn, D. Wiersr, S. Legg, nd D. Hssbis. Humn-level conrol hrough deep reinforcemen lerning. Nure, 58(75): , 5. A. Nir, P. Srinivsn, S. Blckwell, C. Alcicek, R. Feron, A. D. Mri, V. Pnneershelvm, M. Suleymn, C. Beie, S. Peersen, S. Legg, V. Mnih, K. Kvukcuoglu, nd D. Silver. Mssively prllel mehods for deep reinforcemen lerning. In Deep Lerning Workshop, ICML, 5. M. Riedmiller. Neurl fied Q ierion - firs experiences wih d efficien neurl reinforcemen lerning mehod. In Proceedings of he 6h Europen Conference on Mchine Lerning, pges Springer, 5. B. Sllns nd G. E. Hinon. Reinforcemen lerning wih fcored ses nd cions. The Journl of Mchine Lerning Reserch, 5: 63 88,. A. L. Srehl, L. Li, nd M. L. Limn. Reinforcemen lerning in finie MDPs: PAC nlysis. The Journl of Mchine Lerning Reserch, :3, 9. R. S. Suon. Lerning o predic by he mehods of emporl differences. Mchine lerning, 3():9, 988. R. S. Suon. Inegred rchiecures for lerning, plnning, nd recing bsed on pproximing dynmic progrmming. In Proceedings of he sevenh inernionl conference on mchine lerning, pges 6, 99. R. S. Suon nd A. G. Bro. Inroducion o reinforcemen lerning. MIT Press, 998. R. S. Suon, A. R. Mhmood, nd M. Whie. An emphic pproch o he problem of off-policy emporl-difference lerning. rxiv preprin rxiv:53.69, 5. I. Szi nd A. Lőrincz. The mny fces of opimism: unifying pproch. In Proceedings of he 5h inernionl conference on Mchine lerning, pges ACM, 8. G. Tesuro. Temporl difference lerning nd d-gmmon. Communicions of he ACM, 38(3):58 68, 995. S. Thrun nd A. Schwrz. Issues in using funcion pproximion for reinforcemen lerning. In M. Mozer, P. Smolensky, D. Tourezky, J. Elmn, nd A. Weigend, ediors, Proceedings of he 993 Connecionis Models Summer School, Hillsdle, NJ, 993. Lwrence Erlbum. J. N. Tsisiklis nd B. Vn Roy. An nlysis of emporl-difference lerning wih funcion pproximion. IEEE Trnscions on Auomic Conrol, (5):67 69, 997. H. vn Hssel. Double Q-lerning. Advnces in Neurl Informion Processing Sysems, 3:63 6,. H. vn Hssel. Insighs in Reinforcemen Lerning. PhD hesis, Urech Universiy,. C. J. C. H. Wkins. Lerning from delyed rewrds. PhD hesis, Universiy of Cmbridge Englnd, 989.

arxiv: v1 [cs.lg] 22 Sep 2015

arxiv: v1 [cs.lg] 22 Sep 2015 rxiv:59.646v [cs.lg] Sep 5 DEEP REINFOREMENT LEARNING WITH DOUBLE Q-LEARNING HADO VAN HASSELT, ARTHUR GUEZ, AND DAVID SILVER GOOGLE DEEPMIND ABSTRAT. The populr Q-lerning lgorihm is known o overesime cion