Deep Reinforcement Learning with Double Q-Learning

Size: px
Start display at page:

Download "Deep Reinforcement Learning with Double Q-Learning"

Transcription

1 Proceedings of he Thirieh AAAI Conference on Arificil Inelligence (AAAI-6) Deep Reinforcemen Lerning wih Double Q-Lerning Hdo vn Hssel, Arhur Guez, nd Dvid Silver Google DeepMind Absrc The populr Q-lerning lgorihm is known o overesime cion vlues under cerin condiions. I ws no previously known wheher, in prcice, such overesimions re common, wheher hey hrm performnce, nd wheher hey cn generlly be prevened. In his pper, we nswer ll hese quesions ffirmively. In priculr, we firs show h he recen lgorihm, which combines Q-lerning wih deep neurl nework, suffers from subsnil overesimions in some gmes in he Ari 6 domin. We hen show h he ide behind he Double Q-lerning lgorihm, which ws inroduced in bulr seing, cn be generlized o work wih lrge-scle funcion pproximion. We propose specific dpion o he lgorihm nd show h he resuling lgorihm no only reduces he observed overesimions, s hypohesized, bu h his lso leds o much beer performnce on severl gmes. The gol of reinforcemen lerning (Suon nd Bro 998) is o lern good policies for sequenil decision problems, by opimizing cumulive fuure rewrd signl. Q-lerning (Wkins 989) is one of he mos populr reinforcemen lerning lgorihms, bu i is known o someimes lern unrelisiclly high cion vlues becuse i includes mximizion sep over esimed cion vlues, which ends o prefer overesimed o underesimed vlues. In previous work, overesimions hve been ribued o insufficienly flexible funcion pproximion (Thrun nd Schwrz 993) nd noise (vn Hssel, ). In his pper, we unify hese views nd show overesimions cn occur when he cion vlues re inccure, irrespecive of he source of pproximion error. Of course, imprecise vlue esimes re he norm during lerning, which indices h overesimions my be much more common hn previously pprecied. I is n open quesion wheher, if he overesimions do occur, his negively ffecs performnce in prcice. Overopimisic vlue esimes re no necessrily problem in nd of hemselves. If ll vlues would be uniformly higher hen he relive cion preferences re preserved nd we would no expec he resuling policy o be ny worse. Furhermore, i is known h someimes i is good o be opimisic: opimism in he fce of unceriny is well-known Copyrigh c 6, Associion for he Advncemen of Arificil Inelligence ( All righs reserved. explorion echnique (Kelbling e l. 996). If, however, he overesimions re no uniform nd no concenred ses bou which we wish o lern more, hen hey migh negively ffec he quliy of he resuling policy. Thrun nd Schwrz (993) give specific exmples in which his leds o subopiml policies, even sympoiclly. To es wheher overesimions occur in prcice nd scle, we invesige he performnce of he recen lgorihm (Mnih e l. 5). combines Q-lerning wih flexible deep neurl nework nd ws esed on vried nd lrge se of deerminisic Ari 6 gmes, reching humn-level performnce on mny gmes. In some wys, his seing is bes-cse scenrio for Q-lerning, becuse he deep neurl nework provides flexible funcion pproximion wih he poenil for low sympoic pproximion error, nd he deerminism of he environmens prevens he hrmful effecs of noise. Perhps surprisingly, we show h even in his comprively fvorble seing someimes subsnilly overesimes he vlues of he cions. We show h he Double Q-lerning lgorihm (vn Hssel ), which ws firs proposed in bulr seing, cn be generlized o rbirry funcion pproximion, including deep neurl neworks. We use his o consruc new lgorihm clled. This lgorihm no only yields more ccure vlue esimes, bu leds o much higher scores on severl gmes. This demonsres h he overesimions of indeed led o poorer policies nd h i is beneficil o reduce hem. In ddiion, by improving upon we obin se-of-he-r resuls on he Ari domin. Bckground To solve sequenil decision problems we cn lern esimes for he opiml vlue of ech cion, defined s he expeced sum of fuure rewrds when king h cion nd following he opiml policy herefer. Under given policy π, he rue vlue of n cion in se s is Q π (s, ) E [R + γr +... S = s, A =, π], where γ [, ] is discoun fcor h rdes off he impornce of immedie nd ler rewrds. The opiml vlue is hen Q (s, ) =mx π Q π (s, ). An opiml policy is esily derived from he opiml vlues by selecing he highesvlued cion in ech se. 9

2 Esimes for he opiml cion vlues cn be lerned using Q-lerning (Wkins 989), form of emporl difference lerning (Suon 988). Mos ineresing problems re oo lrge o lern ll cion vlues in ll ses seprely. Insed, we cn lern prmeerized vlue funcion Q(s, ; θ ). The sndrd Q-lerning upde for he prmeers fer king cion A in se S nd observing he immedie rewrd R + nd resuling se S + is hen θ + = θ +α(y Q Q(S,A ; θ )) θ Q(S,A ; θ ). () where α is sclr sep size nd he rge Y Q is defined s Y Q R + + γ mx Q(S +,; θ ). () This upde resembles sochsic grdien descen, upding he curren vlue Q(S,A ; θ ) owrds rge vlue Y Q. Deep Q Neworks A deep Q nework () is muli-lyered neurl nework h for given se s oupus vecor of cion vlues Q(s, ; θ), where θ re he prmeers of he nework. For n n-dimensionl se spce nd n cion spce conining m cions, he neurl nework is funcion from R n o R m. Two imporn ingrediens of he lgorihm s proposed by Mnih e l. (5) re he use of rge nework, nd he use of experience reply. The rge nework, wih prmeers θ, is he sme s he online nework excep h is prmeers re copied every τ seps from he online nework, so h hen θ = θ, nd kep fixed on ll oher seps. The rge used by is hen Y R + + γ mx Q(S +,; θ ). (3) For he experience reply (Lin 99), observed rnsiions re sored for some ime nd smpled uniformly from his memory bnk o upde he nework. Boh he rge nework nd he experience reply drmiclly improve he performnce of he lgorihm (Mnih e l. 5). Double Q-lerning The mx operor in sndrd Q-lerning nd, in () nd (3), uses he sme vlues boh o selec nd o evlue n cion. This mkes i more likely o selec overesimed vlues, resuling in overopimisic vlue esimes. To preven his, we cn decouple he selecion from he evluion. In Double Q-lerning (vn Hssel ), wo vlue funcions re lerned by ssigning experiences rndomly o upde one of he wo vlue funcions, resuling in wo ses of weighs, θ nd θ. For ech upde, one se of weighs is used o deermine he greedy policy nd he oher o deermine is vlue. For cler comprison, we cn unngle he selecion nd evluion in Q-lerning nd rewrie is rge () s Y Q = R + + γq(s +, rgmx Q(S +,; θ ); θ ). The Double Q-lerning error cn hen be wrien s Y DoubleQ R + + γq(s +, rgmx Q(S +,; θ ); θ ). () Noice h he selecion of he cion, in he rgmx, is sill due o he online weighs θ. This mens h, s in Q- lerning, we re sill esiming he vlue of he greedy policy ccording o he curren vlues, s defined by θ.however, we use he second se of weighs θ o firly evlue he vlue of his policy. This second se of weighs cn be upded symmericlly by swiching he roles of θ nd θ. Overopimism due o esimion errors Q-lerning s overesimions were firs invesiged by Thrun nd Schwrz (993), who showed h if he cion vlues conin rndom errors uniformly disribued in n inervl [ ɛ, ɛ] hen ech rge is overesimed up o γɛ m m+, where m is he number of cions. In ddiion, Thrun nd Schwrz give concree exmple in which hese overesimions even sympoiclly led o sub-opiml policies, nd show he overesimions mnifes hemselves in smll oy problem when using funcion pproximion. Vn Hssel () noed h noise in he environmen cn led o overesimions even when using bulr represenion, nd proposed Double Q-lerning s soluion. In his secion we demonsre more generlly h esimion errors of ny kind cn induce n upwrd bis, regrdless of wheher hese errors re due o environmenl noise, funcion pproximion, non-sionriy, or ny oher source. This is imporn, becuse in prcice ny mehod will incur some inccurcies during lerning, simply due o he fc h he rue vlues re iniilly unknown. The resul by Thrun nd Schwrz (993) cied bove gives n upper bound o he overesimion for specific seup, bu i is lso possible, nd poenilly more ineresing, o derive lower bound. Theorem. Consider se s in which ll he rue opiml cion vlues re equl Q (s, ) =V (s) for some V (s). Le Q be rbirry vlue esimes h re on he whole unbised in he sense h (Q (s, ) V (s)) =, bu h re no ll correc, such h m (Q (s, ) V (s)) = C for some C>, where m is he number of cions in s. Under hese condiions, mx Q (s, ) V (s)+ C m. This lower bound is igh. Under he sme condiions, he lower bound on he bsolue error of he Double Q-lerning esime is zero. (Proof in ppendix.) Noe h we did no need o ssume h esimion errors for differen cions re independen. This heorem shows h even if he vlue esimes re on verge correc, esimion errors of ny source cn drive he esimes up nd wy from he rue opiml vlues. The lower bound in Theorem decreses wih he number of cions. This is n rifc of considering he lower bound, which requires very specific vlues o be ined. More ypiclly, he overopimism increses wih he number of cions s shown in Figure. Q-lerning s overesimions here indeed increse wih he number of cions, while Double Q-lerning is unbised. As noher exmple, if for ll cions Q (s, ) =V (s) nd he esimion errors Q (s, ) V (s) re uniformly rndom in [, ], hen he overopimism is m m+. (Proof in ppendix.) 95

3 error number of cions mx Q(s, ) V (s) Q (s, rgmx Q(s, )) V (s) Figure : The ornge brs show he bis in single Q- lerning upde when he cion vlues re Q(s, ) = V (s)+ɛ nd he errors {ɛ } m = re independen sndrd norml rndom vribles. The second se of cion vlues Q, used for he blue brs, ws genered ideniclly nd independenly. All brs re he verge of repeiions. We now urn o funcion pproximion nd consider rel-vlued coninuous se spce wih discree cions in ech se. For simpliciy, he rue opiml cion vlues in his exmple depend only on se so h in ech se ll cions hve he sme rue vlue. These rue vlues re shown in he lef column of plos in Figure (purple lines) nd re defined s eiher Q (s, ) = sin(s) (op row) or Q (s, ) =exp( s ) (middle nd boom rows). The lef plos lso show n pproximion for single cion (green lines) s funcion of se s well s he smples he esime is bsed on (green dos). The esime is d-degree polynomil h is fi o he rue vlues smpled ses, where d = 6 (op nd middle rows) or d = 9 (boom row). The smples mch he rue funcion excly: here is no noise nd we ssume we hve ground ruh for he cion vlue on hese smpled ses. The pproximion is inexc even on he smpled ses for he op wo rows becuse he funcion pproximion is insufficienly flexible. In he boom row, he funcion is flexible enough o fi he green dos, bu his reduces he ccurcy in unsmpled ses. Noice h he smpled ses re spced furher pr ner he lef side of he lef plos, resuling in lrger esimion errors. In mny wys his is ypicl lerning seing, where ech poin in ime we only hve limied d. The middle column of plos in Figure shows esimed cion vlues for ll cions (green lines), s funcions of se, long wih he mximum cion vlue in ech se (blck dshed line). Alhough he rue vlue funcion is he sme for ll cions, he pproximions differ becuse hey re bsed on differen ses of smpled ses. The mximum is ofen higher hn he ground ruh shown in purple on he lef. This is confirmed in he righ plos, which shows he difference beween he blck nd purple curves in ornge. The ornge line is lmos lwys posiive, indicing n upwrd bis. The righ plos lso show he esimes from Double Ech cion-vlue funcion is fi wih differen subse of ineger ses. Ses 6 nd 6 re lwys included o void exrpolions, nd for ech cion wo djcen inegers re missing: for cion ses 5 nd re no smpled, for ses nd 3 re no smpled, nd so on. This cuses he esimed vlues o differ. Q-lerning in blue, which re on verge much closer o zero. This demonsres h Double Q-lerning indeed cn successfully reduce he overopimism of Q-lerning. The differen rows in Figure show vriions of he sme experimen. The difference beween he op nd middle rows is he rue vlue funcion, demonsring h overesimions re no n rifc of specific rue vlue funcion. The difference beween he middle nd boom rows is he flexibiliy of he funcion pproximion. In he lef-middle plo, he esimes re even incorrec for some of he smpled ses becuse he funcion is insufficienly flexible. The funcion in he boom-lef plo is more flexible bu his cuses higher esimion errors for unseen ses, resuling in higher overesimions. This is imporn becuse flexible prmeric funcion pproximion is ofen employed in reinforcemen lerning (see, e.g., Tesuro 995, Sllns nd Hinon, Riedmiller 5, nd Mnih e l. 5). In conrs o vn Hssel (), we did no use sisicl rgumen o find overesimions, he process o obin Figure is fully deerminisic. In conrs o Thrun nd Schwrz (993), we did no rely on inflexible funcion pproximion wih irreducible sympoic errors; he boom row shows h funcion h is flexible enough o cover ll smples leds o high overesimions. This indices h he overesimions cn occur quie generlly. In he exmples bove, overesimions occur even when ssuming we hve smples of he rue cion vlue cerin ses. The vlue esimes cn furher deeriore if we boosrp off of cion vlues h re lredy overopimisic, since his cuses overesimions o propge hroughou our esimes. Alhough uniformly overesiming vlues migh no hur he resuling policy, in prcice overesimion errors will differ for differen ses nd cions. Overesimion combined wih boosrpping hen hs he pernicious effec of propging he wrong relive informion bou which ses re more vluble hn ohers, direcly ffecing he quliy of he lerned policies. The overesimions should no be confused wih opimism in he fce of unceriny (Suon 99, Agrwl 995, Kelbling e l. 996, Auer l., Brfmn nd Tennenholz 3, Szi nd Lõrincz 8, Srehl nd Limn 9), where n explorion bonus is given o ses or cions wih uncerin vlues. The overesimions discussed here occur only fer upding, resuling in overopimism in he fce of ppren ceriny. Thrun nd Schwrz (993) noed h, in conrs o opimism in he fce of unceriny, hese overesimions cully cn impede lerning n opiml policy. We confirm his negive effec on policy quliy in our experimens: when we reduce he overesimions using Double Q-lerning, he policies improve. The ide of Double Q-lerning is o reduce overesimions by decomposing he mx operion in he rge ino cion We rbirrily used he smples of cion i+5 (for i 5) or i 5 (for i>5) s he second se of smples for he double esimor of cion i. 96

4 True vlue nd n esime Q (s, ) Q (s, ) All esimes nd mx mx Q (s, ) Bis s funcion of se mx Q (s, ) mx Q (s, ) Double-Q esime Averge error +.6. Q (s, ) Q (s, ) mx Q (s, ) mx Q (s, ) mx Q (s, ) Double-Q esime Q (s, ) mx Q (s, ) mx Q (s, ) mx Q (s, ) Q (s, ) 6 6 se 6 6 se Double-Q esime 6 6 se. Figure : Illusrion of overesimions during lerning. In ech se (x-xis), here re cions. The lef column shows he rue vlues V (s) (purple line). All rue cion vlues re defined by Q (s, ) =V (s). The green line shows esimed vlues Q(s, ) for one cion s funcion of se, fied o he rue vlue severl smpled ses (green dos). The middle column plos show ll he esimed vlues (green), nd he mximum of hese vlues (dshed blck). The mximum is higher hn he rue vlue (purple, lef plo) lmos everywhere. The righ column plos shows he difference in ornge. The blue line in he righ plos is he esime used by Double Q-lerning wih second se of smples for ech se. The blue line is much closer o zero, indicing less bis. The hree rows correspond o differen rue funcions (lef, purple) or cpciies of he fied funcion (lef, green). (Deils in he ex) selecion nd cion evluion. Alhough no fully decoupled, he rge nework in he rchiecure provides nurl cndide for he second vlue funcion, wihou hving o inroduce ddiionl neworks. We herefore propose o evlue he greedy policy ccording o he online nework, bu using he rge nework o esime is vlue. In reference o boh Double Q-lerning nd, we refer o he resuling lgorihm s. Is upde is he sme s for, bu replcing he rge Y wih Y Double R + + γq(s +, rgmx Q(S +,; θ ), θ ). In comprison o Double Q-lerning (), he weighs of he second nework θ re replced wih he weighs of he rge nework θ for he evluion of he curren greedy policy. The upde o he rge nework sys unchnged from, nd remins periodic copy of he online nework. This version of is perhps he miniml possible chnge o owrds Double Q-lerning. The gol is o ge mos of he benefi of Double Q-lerning, while keeping he res of he lgorihm inc for fir comprison, nd wih miniml compuionl overhed. Empiricl resuls In his secion, we nlyze he overesimions of nd show h improves over boh in erms of vlue ccurcy nd in erms of policy quliy. To furher es he robusness of he pproch we ddiionlly evlue he lgorihms wih rndom srs genered from exper humn rjecories, s proposed by Nir e l. (5). Our esbed consiss of Ari 6 gmes, using he Arcde Lerning Environmen (Bellemre e l. 3). The gol is for single lgorihm, wih fixed se of hyperprmeers, o lern o ply ech of he gmes seprely from inercion given only he screen pixels s inpu. This is demnding esbed: no only re he inpus high-dimensionl, he gme visuls nd gme mechnics vry subsnilly beween gmes. Good soluions mus herefore rely hevily on he lerning lgorihm i is no prciclly fesible o overfi he domin by relying only on uning. We closely follow he experimenl seup nd nework rchiecure used by Mnih e l. (5). Briefly, he nework rchiecure is convoluionl neurl nework (Fukushim 988, Lecun e l. 998) wih 3 convoluion lyers nd fully-conneced hidden lyer (pproximely.5m prmeers in ol). The nework kes he ls four frmes s inpu nd oupus he cion vlue of ech cion. On ech gme, he nework is rined on single GPU for M frmes. Resuls on overopimism Figure 3 shows exmples of s overesimions in six Ari gmes. nd were boh rined under he exc condiions described by Mnih e l. (5). is consisenly nd someimes vsly overopimisic bou he vlue of he curren greedy policy, s cn be seen by compring he ornge lerning curves in he op row of plos o he srigh ornge lines, which represen he cul discouned vlue of he bes lerned policy. More precisely, he (verged) vlue esimes re compued regulrly during rining wih full evluion phses of lengh T = 5, seps s T T = rgmx Q(S,; θ). 97

5 Vlue esimes Alien Spce Invders.5. Time Pilo 8 esime.5 esime. rue vlue rue vlue Trining seps (in millions) 6 Zxxon Vlue esimes (log scle) Score 3 Wizrd of Wor 5 5 Wizrd of Wor 5 5 Trining seps (in millions) Aserix 5 5 Aserix 5 5 Trining seps (in millions) Figure 3: The op nd middle rows show vlue esimes by (ornge) nd (blue) on six Ari gmes. The resuls re obined by running nd wih 6 differen rndom seeds wih he hyper-prmeers employed by Mnih e l. (5). The drker line shows he medin over seeds nd we verge he wo exreme vlues o obin he shded re (i.e., % nd 9% quniles wih liner inerpolion). The srigh horizonl ornge (for ) nd blue (for Double ) lines in he op row re compued by running he corresponding gens fer lerning concluded, nd verging he cul discouned reurn obined from ech visied se. These srigh lines would mch he lerning curves he righ side of he plos if here is no bis. The middle row shows he vlue esimes (in log scle) for wo gmes in which s overopimism is quie exreme. The boom row shows he derimenl effec of his on he score chieved by he gen s i is evlued during rining: he scores drop when he overesimions begin. Lerning wih is much more sble. The ground ruh verged vlues re obined by running he bes lerned policies for severl episodes nd compuing he cul cumulive rewrds. Wihou overesimions we would expec hese quniies o mch up (i.e., he curve o mch he srigh line he righ of ech plo). Insed, he lerning curves of consisenly end up much higher hn he rue vlues. The lerning curves for, shown in blue, re much closer o he blue srigh line represening he rue vlue of he finl policy. Noe h he blue srigh line is ofen higher hn he ornge srigh line. This indices h does no jus produce more ccure vlue esimes bu lso beer policies. More exreme overesimions re shown in he middle wo plos, where is highly unsble on he gmes Aserix nd Wizrd of Wor. Noice he log scle for he vlues on he y-xis. The boom wo plos shows he corresponding scores for hese wo gmes. Noice h he increses in vlue esimes for in he middle plos coincide wih decresing scores in boom plos. Agin, his indices h he overesimions re hrming he quliy of he resuling policies. If seen in isolion, one migh perhps be emped o hink he observed insbiliy is reled o inheren insbiliy problems of off-policy lerning wih funcion pproximion (Bird 995, Tsisiklis nd Vn Roy 997, Mei no ops humn srs D D D (uned) Medin 93% 5% 7% 88% 7% Men % 33% % 73% 75% Tble : Summrized normlized performnce on 9 gmes for up o 5 minues wih up o 3 no ops he sr of ech episode, nd for up o 3 minues wih rndomly seleced humn sr poins. Resuls for re from Mnih e l. (5) (no ops) nd Nir e l. (5) (humn srs)., Suon e l. 5). However, we see h lerning is much more sble wih, suggesing h he cuse for hese insbiliies is in fc Q-lerning s overopimism. Figure 3 only shows few exmples, bu overesimions were observed for in ll 9 esed Ari gmes, lbei in vrying mouns. Quliy of he lerned policies Overopimism does no lwys dversely ffec he quliy of he lerned policy. For exmple, chieves opiml 98

6 behvior in Pong despie slighly overesiming he policy vlue. Neverheless, reducing overesimions cn significnly benefi he sbiliy of lerning; we see cler exmples of his in Figure 3. We now ssess more generlly how much helps in erms of policy quliy by evluing on ll 9 gmes h ws esed on. As described by Mnih e l. (5) ech evluion episode srs by execuing specil no-op cion h does no ffec he environmen up o 3 imes, o provide differen sring poins for he gen. Some explorion during evluion provides ddiionl rndomizion. For Double we used he exc sme hyper-prmeers s for, o llow for conrolled experimen focused jus on reducing overesimions. The lerned policies re evlued for 5 mins of emulor ime (8, frmes) wih n ɛ- greedy policy where ɛ =.5. The scores re verged over episodes. The only difference beween nd is he rge, using Y Double rher hn Y. This evluion is somewh dversril, s he used hyperprmeers were uned for bu no for. To obin summry sisics cross gmes, we normlize he score for ech gme s follows: score normlized = score gen score rndom. (5) score humn score rndom The rndom nd humn scores re he sme s used by Mnih e l. (5), nd re given in he ppendix. Tble, under no ops, shows h on he whole Double clerly improves over. A deiled comprison (in ppendix) shows h here re severl gmes in which grely improves upon. Noeworhy exmples include Rod Runner (from 33% o 67%), Aserix (from 7% o 8%), Zxxon (from 5% o %), nd Double Dunk (from 7% o 397%). The Goril lgorihm (Nir e l. 5), which is mssively disribued version of, is no included in he ble becuse he rchiecure nd infrsrucure is sufficienly differen o mke direc comprison uncler. For compleeness, we noe h Goril obined medin nd men normlized scores of 96% nd 95%, respecively. Robusness o Humn srs One concern wih he previous evluion is h in deerminisic gmes wih unique sring poin he lerner could poenilly lern o remember sequences of cions wihou much need o generlize. While successful, he soluion would no be priculrly robus. By esing he gens from vrious sring poins, we cn es wheher he found soluions generlize well, nd s such provide chllenging esbed for he lerned polices (Nir e l. 5). We obined sring poins smpled for ech gme from humn exper s rjecory, s proposed by Nir e l. (5). We sr n evluion episode from ech of hese sring poins nd run he emulor for up o 8, frmes (3 mins 6Hz including he rjecory before he sring poin). Ech gen is only evlued on he rewrds ccumuled fer he sring poin. For his evluion we include uned version of Double. Some uning is pproprie becuse he hyperprme- Video Pinbll Alnis Demon Ack Brekou Assul Double Dunk Robonk Gopher Boxing Sr Gunner Rod Runner Krull Crzy Climber Kngroo Aserix Defender Phoenix Up nd Down Spce Invders Jmes Bond Enduro Kung-Fu Mser Wizrd of Wor Nme This Gme Time Pilo Bnk Heis Bem Rider Freewy Pong Zxxon Fishing Derby Tennis Q*Ber Surround River Rid Ble Zone Ice Hockey Tunkhm H.E.R.O. Berzerk Seques Chopper Commnd Frosbie Skiing Bowling Cenipede Alien Yrs Revenge Amidr Ms. Pcmn Pifll Aseroids Monezum s Revenge Venure Grvir Prive Eye Solris % % Humn % (uned) 3% % Normlized score 75% 5% 5% % 5% % 5% Figure : Normlized scores on 57 Ari gmes, esed for episodes per gme wih humn srs. Compred o Mnih e l. (5), eigh gmes ddiionl gmes were esed. These re indiced wih srs nd bold fon. ers were uned for, which is differen lgorihm. For he uned version of, we incresed he number of frmes beween ech wo copies of he rge nework from, o 3,, o reduce overesimions furher becuse immediely fer ech swich nd boh rever o Q-lerning. In ddiion, we reduced he explorion during lerning from ɛ =. o ɛ =., nd hen used ɛ =. during evluion. Finlly, he uned version uses single shred bis for ll cion vlues in he op lyer of he nework. Ech of hese chnges improved performnce nd ogeher hey resul in clerly beer resuls. 3 Tble repors summry sisics for his evluion (under humn srs) on he 9 gmes from Mnih e l. (5). obins clerly higher medin nd 3 Excep for Tennis, where he lower ɛ during rining seemed o hur rher hn help. 99

7 men scores. Agin Goril (Nir e l. 5) is no included in he ble, bu for compleeness noe i obined medin of 78% nd men of 59%. Deiled resuls, plus resuls for n ddiionl 8 gmes, re vilble in Figure nd in he ppendix. On severl gmes he improvemens from o re sriking, in some cses bringing scores much closer o humn, or even surpssing hese. ppers more robus o his more chllenging evluion, suggesing h pproprie generlizions occur nd h he found soluions do no exploi he deerminism of he environmens. This is ppeling, s i indices progress owrds finding generl soluions rher hn deerminisic sequence of seps h would be less robus. Discussion This pper hs five conribuions. Firs, we hve shown why Q-lerning cn be overopimisic in lrge-scle problems, even if hese re deerminisic, due o he inheren esimion errors of lerning. Second, by nlyzing he vlue esimes on Ari gmes we hve shown h hese overesimions re more common nd severe in prcice hn previously cknowledged. Third, we hve shown h Double Q-lerning cn be used scle o successfully reduce his overopimism, resuling in more sble nd relible lerning. Fourh, we hve proposed specific implemenion clled, h uses he exising rchiecure nd deep neurl nework of he lgorihm wihou requiring ddiionl neworks or prmeers. Finlly, we hve shown h finds beer policies, obining new se-ofhe-r resuls on he Ari 6 domin. Acknowledgmens We would like o hnk Tom Schul, Volodymyr Mnih, Mrc Bellemre, Thoms Degris, Georg Osrovski, nd Richrd Suon for helpful commens, nd everyone Google Deep- Mind for consrucive reserch environmen. References R. Agrwl. Smple men bsed index policies wih O(log n) regre for he muli-rmed bndi problem. Advnces in Applied Probbiliy, pges 5 78, 995. P. Auer, N. Ces-Binchi, nd P. Fischer. Finie-ime nlysis of he mulirmed bndi problem. Mchine lerning, 7(-3):35 56,. L. Bird. Residul lgorihms: Reinforcemen lerning wih funcion pproximion. In Mchine Lerning: Proceedings of he Twelfh Inernionl Conference, pges 3 37, 995. M. G. Bellemre, Y. Nddf, J. Veness, nd M. Bowling. The rcde lerning environmen: An evluion plform for generl gens. J. Arif. Inell. Res. (JAIR), 7:53 79, 3. R. I. Brfmn nd M. Tennenholz. R-mx- generl polynomil ime lgorihm for ner-opiml reinforcemen lerning. The Journl of Mchine Lerning Reserch, 3:3 3, 3. K. Fukushim. Neocogniron: A hierrchicl neurl nework cpble of visul pern recogniion. Neurl neworks, ():9 3, 988. L. P. Kelbling, M. L. Limn, nd A. W. Moore. Reinforcemen lerning: A survey. Journl of Arificil Inelligence Reserch, : 37 85, 996. Y. LeCun, L. Boou, Y. Bengio, nd P. Hffner. Grdien-bsed lerning pplied o documen recogniion. Proceedings of he IEEE, 86():78 3, 998. L. Lin. Self-improving recive gens bsed on reinforcemen lerning, plnning nd eching. Mchine lerning, 8(3):93 3, 99. H. R. Mei. Grdien emporl-difference lerning lgorihms. PhD hesis, Universiy of Alber,. V. Mnih, K. Kvukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemre, A. Grves, M. Riedmiller, A. K. Fidjelnd, G. Osrovski, S. Peersen, C. Beie, A. Sdik, I. Anonoglou, H. King, D. Kumrn, D. Wiersr, S. Legg, nd D. Hssbis. Humn-level conrol hrough deep reinforcemen lerning. Nure, 58(75): , 5. A. Nir, P. Srinivsn, S. Blckwell, C. Alcicek, R. Feron, A. D. Mri, V. Pnneershelvm, M. Suleymn, C. Beie, S. Peersen, S. Legg, V. Mnih, K. Kvukcuoglu, nd D. Silver. Mssively prllel mehods for deep reinforcemen lerning. In Deep Lerning Workshop, ICML, 5. M. Riedmiller. Neurl fied Q ierion - firs experiences wih d efficien neurl reinforcemen lerning mehod. In Proceedings of he 6h Europen Conference on Mchine Lerning, pges Springer, 5. B. Sllns nd G. E. Hinon. Reinforcemen lerning wih fcored ses nd cions. The Journl of Mchine Lerning Reserch, 5: 63 88,. A. L. Srehl, L. Li, nd M. L. Limn. Reinforcemen lerning in finie MDPs: PAC nlysis. The Journl of Mchine Lerning Reserch, :3, 9. R. S. Suon. Lerning o predic by he mehods of emporl differences. Mchine lerning, 3():9, 988. R. S. Suon. Inegred rchiecures for lerning, plnning, nd recing bsed on pproximing dynmic progrmming. In Proceedings of he sevenh inernionl conference on mchine lerning, pges 6, 99. R. S. Suon nd A. G. Bro. Inroducion o reinforcemen lerning. MIT Press, 998. R. S. Suon, A. R. Mhmood, nd M. Whie. An emphic pproch o he problem of off-policy emporl-difference lerning. rxiv preprin rxiv:53.69, 5. I. Szi nd A. Lőrincz. The mny fces of opimism: unifying pproch. In Proceedings of he 5h inernionl conference on Mchine lerning, pges ACM, 8. G. Tesuro. Temporl difference lerning nd d-gmmon. Communicions of he ACM, 38(3):58 68, 995. S. Thrun nd A. Schwrz. Issues in using funcion pproximion for reinforcemen lerning. In M. Mozer, P. Smolensky, D. Tourezky, J. Elmn, nd A. Weigend, ediors, Proceedings of he 993 Connecionis Models Summer School, Hillsdle, NJ, 993. Lwrence Erlbum. J. N. Tsisiklis nd B. Vn Roy. An nlysis of emporl-difference lerning wih funcion pproximion. IEEE Trnscions on Auomic Conrol, (5):67 69, 997. H. vn Hssel. Double Q-lerning. Advnces in Neurl Informion Processing Sysems, 3:63 6,. H. vn Hssel. Insighs in Reinforcemen Lerning. PhD hesis, Urech Universiy,. C. J. C. H. Wkins. Lerning from delyed rewrds. PhD hesis, Universiy of Cmbridge Englnd, 989.

arxiv: v1 [cs.lg] 22 Sep 2015

arxiv: v1 [cs.lg] 22 Sep 2015 rxiv:59.646v [cs.lg] Sep 5 DEEP REINFOREMENT LEARNING WITH DOUBLE Q-LEARNING HADO VAN HASSELT, ARTHUR GUEZ, AND DAVID SILVER GOOGLE DEEPMIND ABSTRAT. The populr Q-lerning lgorihm is known o overesime cion

More information

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is

More information

e t dt e t dt = lim e t dt T (1 e T ) = 1

e t dt e t dt = lim e t dt T (1 e T ) = 1 Improper Inegrls There re wo ypes of improper inegrls - hose wih infinie limis of inegrion, nd hose wih inegrnds h pproch some poin wihin he limis of inegrion. Firs we will consider inegrls wih infinie

More information

A Kalman filtering simulation

A Kalman filtering simulation A Klmn filering simulion The performnce of Klmn filering hs been esed on he bsis of wo differen dynmicl models, ssuming eiher moion wih consn elociy or wih consn ccelerion. The former is epeced o beer

More information

Minimum Squared Error

Minimum Squared Error Minimum Squred Error LDF: Minimum Squred-Error Procedures Ide: conver o esier nd eer undersood prolem Percepron y i > for ll smples y i solve sysem of liner inequliies MSE procedure y i = i for ll smples

More information

Minimum Squared Error

Minimum Squared Error Minimum Squred Error LDF: Minimum Squred-Error Procedures Ide: conver o esier nd eer undersood prolem Percepron y i > 0 for ll smples y i solve sysem of liner inequliies MSE procedure y i i for ll smples

More information

0 for t < 0 1 for t > 0

0 for t < 0 1 for t > 0 8.0 Sep nd del funcions Auhor: Jeremy Orloff The uni Sep Funcion We define he uni sep funcion by u() = 0 for < 0 for > 0 I is clled he uni sep funcion becuse i kes uni sep = 0. I is someimes clled he Heviside

More information

4.8 Improper Integrals

4.8 Improper Integrals 4.8 Improper Inegrls Well you ve mde i hrough ll he inegrion echniques. Congrs! Unforunely for us, we sill need o cover one more inegrl. They re clled Improper Inegrls. A his poin, we ve only del wih inegrls

More information

Contraction Mapping Principle Approach to Differential Equations

Contraction Mapping Principle Approach to Differential Equations epl Journl of Science echnology 0 (009) 49-53 Conrcion pping Principle pproch o Differenil Equions Bishnu P. Dhungn Deprmen of hemics, hendr Rn Cmpus ribhuvn Universiy, Khmu epl bsrc Using n eension of

More information

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration. Moion Accelerion Pr : Consn Accelerion Accelerion Accelerion Accelerion is he re of chnge of velociy. = v - vo = Δv Δ ccelerion = = v - vo chnge of velociy elpsed ime Accelerion is vecor, lhough in one-dimensionl

More information

September 20 Homework Solutions

September 20 Homework Solutions College of Engineering nd Compuer Science Mechnicl Engineering Deprmen Mechnicl Engineering A Seminr in Engineering Anlysis Fll 7 Number 66 Insrucor: Lrry Creo Sepember Homework Soluions Find he specrum

More information

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function ENGR 1990 Engineering Mhemics The Inegrl of Funcion s Funcion Previously, we lerned how o esime he inegrl of funcion f( ) over some inervl y dding he res of finie se of rpezoids h represen he re under

More information

3. Renewal Limit Theorems

3. Renewal Limit Theorems Virul Lborories > 14. Renewl Processes > 1 2 3 3. Renewl Limi Theorems In he inroducion o renewl processes, we noed h he rrivl ime process nd he couning process re inverses, in sens The rrivl ime process

More information

A new model for limit order book dynamics

A new model for limit order book dynamics Anewmodelforlimiorderbookdynmics JeffreyR.Russell UniversiyofChicgo,GrdueSchoolofBusiness TejinKim UniversiyofChicgo,DeprmenofSisics Absrc:Thispperproposesnewmodelforlimiorderbookdynmics.Thelimiorderbookconsiss

More information

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6. [~ o o :- o o ill] i 1. Mrices, Vecors, nd Guss-Jordn Eliminion 1 x y = = - z= The soluion is ofen represened s vecor: n his exmple, he process of eliminion works very smoohly. We cn elimine ll enries

More information

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

f t f a f x dx By Lin McMullin f x dx= f b f a. 2 Accumulion: Thoughs On () By Lin McMullin f f f d = + The gols of he AP* Clculus progrm include he semen, Sudens should undersnd he definie inegrl s he ne ccumulion of chnge. 1 The Topicl Ouline includes

More information

Probability, Estimators, and Stationarity

Probability, Estimators, and Stationarity Chper Probbiliy, Esimors, nd Sionriy Consider signl genered by dynmicl process, R, R. Considering s funcion of ime, we re opering in he ime domin. A fundmenl wy o chrcerize he dynmics using he ime domin

More information

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang jordnmcd Eigenvlue-eigenvecor pproch o solving firs order ODEs -- ordn norml (cnonicl) form Insrucor: Nm Sun Wng Consider he following se of coupled firs order ODEs d d x x 5 x x d d x d d x x x 5 x x

More information

S Radio transmission and network access Exercise 1-2

S Radio transmission and network access Exercise 1-2 S-7.330 Rdio rnsmission nd nework ccess Exercise 1 - P1 In four-symbol digil sysem wih eqully probble symbols he pulses in he figure re used in rnsmission over AWGN-chnnel. s () s () s () s () 1 3 4 )

More information

Physics 2A HW #3 Solutions

Physics 2A HW #3 Solutions Chper 3 Focus on Conceps: 3, 4, 6, 9 Problems: 9, 9, 3, 41, 66, 7, 75, 77 Phsics A HW #3 Soluions Focus On Conceps 3-3 (c) The ccelerion due o grvi is he sme for boh blls, despie he fc h he hve differen

More information

( ) ( ) ( ) ( ) ( ) ( y )

( ) ( ) ( ) ( ) ( ) ( y ) 8. Lengh of Plne Curve The mos fmous heorem in ll of mhemics is he Pyhgoren Theorem. I s formulion s he disnce formul is used o find he lenghs of line segmens in he coordine plne. In his secion you ll

More information

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem Mking Comple Decisions Mrkov Decision Processes Vsn Honvr Bioinformics nd Compuionl Biology Progrm Cener for Compuionl Inelligence, Lerning, & Discovery honvr@cs.ise.edu www.cs.ise.edu/~honvr/ www.cild.ise.edu/

More information

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples. Improper Inegrls To his poin we hve only considered inegrls f(x) wih he is of inegrion nd b finie nd he inegrnd f(x) bounded (nd in fc coninuous excep possibly for finiely mny jump disconinuiies) An inegrl

More information

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008) MATH 14 AND 15 FINAL EXAM REVIEW PACKET (Revised spring 8) The following quesions cn be used s review for Mh 14/ 15 These quesions re no cul smples of quesions h will pper on he finl em, bu hey will provide

More information

Solutions to Problems from Chapter 2

Solutions to Problems from Chapter 2 Soluions o Problems rom Chper Problem. The signls u() :5sgn(), u () :5sgn(), nd u h () :5sgn() re ploed respecively in Figures.,b,c. Noe h u h () :5sgn() :5; 8 including, bu u () :5sgn() is undeined..5

More information

Reinforcement Learning

Reinforcement Learning Reiforceme Corol lerig Corol polices h choose opiml cios Q lerig Covergece Chper 13 Reiforceme 1 Corol Cosider lerig o choose cios, e.g., Robo lerig o dock o bery chrger o choose cios o opimize fcory oupu

More information

Mathematics 805 Final Examination Answers

Mathematics 805 Final Examination Answers . 5 poins Se he Weiersrss M-es. Mhemics 85 Finl Eminion Answers Answer: Suppose h A R, nd f n : A R. Suppose furher h f n M n for ll A, nd h Mn converges. Then f n converges uniformly on A.. 5 poins Se

More information

Neural assembly binding in linguistic representation

Neural assembly binding in linguistic representation Neurl ssembly binding in linguisic represenion Frnk vn der Velde & Mrc de Kmps Cogniive Psychology Uni, Universiy of Leiden, Wssenrseweg 52, 2333 AK Leiden, The Neherlnds, vdvelde@fsw.leidenuniv.nl Absrc.

More information

REAL ANALYSIS I HOMEWORK 3. Chapter 1

REAL ANALYSIS I HOMEWORK 3. Chapter 1 REAL ANALYSIS I HOMEWORK 3 CİHAN BAHRAN The quesions re from Sein nd Shkrchi s e. Chper 1 18. Prove he following sserion: Every mesurble funcion is he limi.e. of sequence of coninuous funcions. We firs

More information

Convergence of Singular Integral Operators in Weighted Lebesgue Spaces

Convergence of Singular Integral Operators in Weighted Lebesgue Spaces EUROPEAN JOURNAL OF PURE AND APPLIED MATHEMATICS Vol. 10, No. 2, 2017, 335-347 ISSN 1307-5543 www.ejpm.com Published by New York Business Globl Convergence of Singulr Inegrl Operors in Weighed Lebesgue

More information

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

PHYSICS 1210 Exam 1 University of Wyoming 14 February points PHYSICS 1210 Em 1 Uniersiy of Wyoming 14 Februry 2013 150 poins This es is open-noe nd closed-book. Clculors re permied bu compuers re no. No collborion, consulion, or communicion wih oher people (oher

More information

5.1-The Initial-Value Problems For Ordinary Differential Equations

5.1-The Initial-Value Problems For Ordinary Differential Equations 5.-The Iniil-Vlue Problems For Ordinry Differenil Equions Consider solving iniil-vlue problems for ordinry differenil equions: (*) y f, y, b, y. If we know he generl soluion y of he ordinry differenil

More information

1.0 Electrical Systems

1.0 Electrical Systems . Elecricl Sysems The ypes of dynmicl sysems we will e sudying cn e modeled in erms of lgeric equions, differenil equions, or inegrl equions. We will egin y looking fmilir mhemicl models of idel resisors,

More information

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m PHYS : Soluions o Chper 3 Home Work. SSM REASONING The displcemen is ecor drwn from he iniil posiion o he finl posiion. The mgniude of he displcemen is he shores disnce beween he posiions. Noe h i is onl

More information

MTH 146 Class 11 Notes

MTH 146 Class 11 Notes 8.- Are of Surfce of Revoluion MTH 6 Clss Noes Suppose we wish o revolve curve C round n is nd find he surfce re of he resuling solid. Suppose f( ) is nonnegive funcion wih coninuous firs derivive on he

More information

1. Introduction. 1 b b

1. Introduction. 1 b b Journl of Mhemicl Inequliies Volume, Number 3 (007), 45 436 SOME IMPROVEMENTS OF GRÜSS TYPE INEQUALITY N. ELEZOVIĆ, LJ. MARANGUNIĆ AND J. PEČARIĆ (communiced b A. Čižmešij) Absrc. In his pper some inequliies

More information

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis) 3/0/018 _mch.doc Pge 1 of 6 T-Mch: Mching Techniques For Driving Ygi-Ud Anenns: T-Mch (Secions 9.5 & 9.7 of Blnis) l s l / l / in The T-Mch is shun-mching echnique h cn be used o feed he driven elemen

More information

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR j IAN KNOWLES 1. Inroducion Consider he forml differenil operor T defined by el, (1) where he funcion q{) is rel-vlued nd loclly

More information

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017 MAT 66 Clculus for Engineers II Noes on Chper 6 Professor: John Quigg Semeser: spring 7 Secion 6.: Inegrion by prs The Produc Rule is d d f()g() = f()g () + f ()g() Tking indefinie inegrls gives [f()g

More information

Reinforcement learning

Reinforcement learning CS 75 Mchine Lening Lecue b einfocemen lening Milos Huskech milos@cs.pi.edu 539 Senno Sque einfocemen lening We wn o len conol policy: : X A We see emples of bu oupus e no given Insed of we ge feedbck

More information

On Source and Channel Codes for Multiple Inputs and Outputs: Does Multiple Description Meet Space Time? 1

On Source and Channel Codes for Multiple Inputs and Outputs: Does Multiple Description Meet Space Time? 1 On Source nd Chnnel Codes for Muliple Inpus nd Oupus: oes Muliple escripion Mee Spce Time? Michelle Effros Rlf Koeer 3 Andre J. Goldsmih 4 Muriel Médrd 5 ep. of Elecricl Eng., 36-93, Cliforni Insiue of

More information

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions ISSNOnline : 39-8753 ISSN Prin : 347-67 An ISO 397: 7 Cerified Orgnizion Vol. 5, Issue 5, My 6 A Time Trunced Improved Group Smpling Plns for Ryleigh nd og - ogisic Disribuions P.Kvipriy, A.R. Sudmni Rmswmy

More information

THREE IMPORTANT CONCEPTS IN TIME SERIES ANALYSIS: STATIONARITY, CROSSING RATES, AND THE WOLD REPRESENTATION THEOREM

THREE IMPORTANT CONCEPTS IN TIME SERIES ANALYSIS: STATIONARITY, CROSSING RATES, AND THE WOLD REPRESENTATION THEOREM THR IMPORTANT CONCPTS IN TIM SRIS ANALYSIS: STATIONARITY, CROSSING RATS, AND TH WOLD RPRSNTATION THORM Prof. Thoms B. Fomb Deprmen of conomics Souhern Mehodis Universi June 8 I. Definiion of Covrince Sionri

More information

Chapter Direct Method of Interpolation

Chapter Direct Method of Interpolation Chper 5. Direc Mehod of Inerpolion Afer reding his chper, you should be ble o:. pply he direc mehod of inerpolion,. sole problems using he direc mehod of inerpolion, nd. use he direc mehod inerpolns o

More information

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor.

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor. In Eercise 1, use sndrd recngulr Cresin coordine sysem. Le ime be represened long he horizonl is. Assume ll ccelerions nd decelerions re consn. 1. Consider PSA iniilly res in he beginning of he lef-hnd

More information

Average & instantaneous velocity and acceleration Motion with constant acceleration

Average & instantaneous velocity and acceleration Motion with constant acceleration Physics 7: Lecure Reminders Discussion nd Lb secions sr meeing ne week Fill ou Pink dd/drop form if you need o swich o differen secion h is FULL. Do i TODAY. Homework Ch. : 5, 7,, 3,, nd 6 Ch.: 6,, 3 Submission

More information

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation Fcorized Decision Forecsing vi Combining Vlue-bsed nd Rewrd-bsed Esimion Brin D. Ziebr Crnegie Mellon Universiy Pisburgh, PA 15213 bziebr@cs.cmu.edu Absrc A powerful recen perspecive for predicing sequenil

More information

Reinforcement Learning. Markov Decision Processes

Reinforcement Learning. Markov Decision Processes einforcemen Lerning Mrkov Decision rocesses Mnfred Huber 2014 1 equenil Decision Mking N-rmed bi problems re no good wy o model sequenil decision problem Only dels wih sic decision sequences Could be miiged

More information

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak .65, MHD Theory of Fusion Sysems Prof. Freidberg Lecure 9: The High e Tokmk Summry of he Properies of n Ohmic Tokmk. Advnges:. good euilibrium (smll shif) b. good sbiliy ( ) c. good confinemen ( τ nr )

More information

Version 001 test-1 swinney (57010) 1. is constant at m/s.

Version 001 test-1 swinney (57010) 1. is constant at m/s. Version 001 es-1 swinne (57010) 1 This prin-ou should hve 20 quesions. Muliple-choice quesions m coninue on he nex column or pge find ll choices before nswering. CubeUniVec1x76 001 10.0 poins Acubeis1.4fee

More information

Think of the Relationship Between Time and Space Again

Think of the Relationship Between Time and Space Again Repor nd Opinion, 1(3),009 hp://wwwsciencepubne sciencepub@gmilcom Think of he Relionship Beween Time nd Spce Agin Yng F-cheng Compny of Ruid Cenre in Xinjing 15 Hongxing Sree, Klmyi, Xingjing 834000,

More information

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities: Mah 4 Eam Review Problems Problem. Calculae he 3rd Taylor polynomial for arcsin a =. Soluion. Le f() = arcsin. For his problem, we use he formula f() + f () + f ()! + f () 3! for he 3rd Taylor polynomial

More information

Removing Redundancy and Inconsistency in Memory- Based Collaborative Filtering

Removing Redundancy and Inconsistency in Memory- Based Collaborative Filtering Removing Redundncy nd Inconsisency in Memory- Bsed Collborive Filering Ki Yu Siemens AG, Corpore Technology & Universiy of Munich, Germny ki.yu.exernl@mchp.siemens. de Xiowei Xu Informion Science Deprmen

More information

One Practical Algorithm for Both Stochastic and Adversarial Bandits

One Practical Algorithm for Both Stochastic and Adversarial Bandits One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Full Version Including Appendices Yevgeny Seldin Queenslnd Universiy of Technology, Brisbne, Ausrli Aleksndrs Slivkins Microsof Reserch, New York

More information

SOME USEFUL MATHEMATICS

SOME USEFUL MATHEMATICS SOME USEFU MAHEMAICS SOME USEFU MAHEMAICS I is esy o mesure n preic he behvior of n elecricl circui h conins only c volges n currens. However, mos useful elecricl signls h crry informion vry wih ime. Since

More information

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL Simeng Liu nd Gregor P. Henze, Ph.D., P.E. Universiy of Nebrsk Lincoln, Archiecurl Engineering 1110 Souh 67 h Sree, Peer Kiewi

More information

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment Mgneosics Br Mgne As fr bck s 4500 yers go, he Chinese discovered h cerin ypes of iron ore could rc ech oher nd cerin mels. Iron filings "mp" of br mgne s field Crefully suspended slivers of his mel were

More information

AJAE appendix for Is Exchange Rate Pass-Through in Pork Meat Export Prices Constrained by the Supply of Live Hogs?

AJAE appendix for Is Exchange Rate Pass-Through in Pork Meat Export Prices Constrained by the Supply of Live Hogs? AJAE ppendix for Is Exchnge Re Pss-Through in Por Me Expor Prices Consrined by he Supply of Live Hogs? Jen-Philippe Gervis Cnd Reserch Chir in Agri-indusries nd Inernionl Trde Cener for Reserch in he Economics

More information

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q). INTEGRALS JOHN QUIGG Eercise. Le f : [, b] R be bounded, nd le P nd Q be priions of [, b]. Prove h if P Q hen U(P ) U(Q) nd L(P ) L(Q). Soluion: Le P = {,..., n }. Since Q is obined from P by dding finiely

More information

Procedia Computer Science

Procedia Computer Science Procedi Compuer Science 00 (0) 000 000 Procedi Compuer Science www.elsevier.com/loce/procedi The Third Informion Sysems Inernionl Conference The Exisence of Polynomil Soluion of he Nonliner Dynmicl Sysems

More information

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1 COMP28: Decision, Compuion nd Lnguge Noe These noes re inended minly s supplemen o he lecures nd exooks; hey will e useful for reminders ou noion nd erminology. Some sic noion nd erminology An lphe is

More information

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit Univeriy of Souhern Cliforni Opimliy of Myopic Policy for Cl of Monoone Affine Rele Muli-Armed Bndi Pri Mnourifrd USC Tr Jvidi UCSD Bhkr Krihnmchri USC Dec 0, 202 Univeriy of Souhern Cliforni Inroducion

More information

Nonlinear System Modelling: How to Estimate the. Highest Significant Order

Nonlinear System Modelling: How to Estimate the. Highest Significant Order IEEE Insrumenion nd Mesuremen Technology Conference nchorge,, US, - My Nonliner Sysem Modelling: ow o Esime he ighes Significn Order Neophyos Chirs, Ceri Evns nd Dvid Rees, Michel Solomou School of Elecronics,

More information

M r. d 2. R t a M. Structural Mechanics Section. Exam CT5141 Theory of Elasticity Friday 31 October 2003, 9:00 12:00 hours. Problem 1 (3 points)

M r. d 2. R t a M. Structural Mechanics Section. Exam CT5141 Theory of Elasticity Friday 31 October 2003, 9:00 12:00 hours. Problem 1 (3 points) Delf Universiy of Technology Fculy of Civil Engineering nd Geosciences Srucurl echnics Secion Wrie your nme nd sudy numer he op righ-hnd of your work. Exm CT5 Theory of Elsiciy Fridy Ocoer 00, 9:00 :00

More information

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218 Chper Moion long srigh line 9/9/05 Physics 8 Gols for Chper How o describe srigh line moion in erms of displcemen nd erge elociy. The mening of insnneous elociy nd speed. Aerge elociy/insnneous elociy

More information

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba Lecure 3 Mondy - Deceber 5, 005 Wrien or ls upded: Deceber 3, 005 P44 Anlyicl Mechnics - I oupled Oscillors c Alex R. Dzierb oupled oscillors - rix echnique In Figure we show n exple of wo coupled oscillors,

More information

How to Prove the Riemann Hypothesis Author: Fayez Fok Al Adeh.

How to Prove the Riemann Hypothesis Author: Fayez Fok Al Adeh. How o Prove he Riemnn Hohesis Auhor: Fez Fok Al Adeh. Presiden of he Srin Cosmologicl Socie P.O.Bo,387,Dmscus,Sri Tels:963--77679,735 Emil:hf@scs-ne.org Commens: 3 ges Subj-Clss: Funcionl nlsis, comle

More information

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations Honours Inroducory Mhs Course 0 Inegrion, Differenil nd Difference Equions Reding: Ching Chper 4 Noe: These noes do no fully cover he meril in Ching, u re men o supplemen your reding in Ching. Thus fr

More information

Tracking Error Performance of Tracking Filters Based on IMM for Threatening Target to Navel Vessel

Tracking Error Performance of Tracking Filters Based on IMM for Threatening Target to Navel Vessel 456 Inernionl Journl Te of Conrol, Hyun Fng Auomion, nd Je nd Weon Sysems, Choi vol. 5, no. 4, pp. 456-46, Augus 7 Trcing Error Performnce of Trcing Filers Bsed on for Threening Trge o Nvel Vessel Te Hyun

More information

Estimating the population parameter, r, q and K based on surplus production model. Wang, Chien-Hsiung

Estimating the population parameter, r, q and K based on surplus production model. Wang, Chien-Hsiung SCTB15 Working Pper ALB 7 Esiming he populion prmeer, r, q nd K bsed on surplus producion model Wng, Chien-Hsiung Biologicl nd Fishery Division Insiue of Ocenogrphy Nionl Tiwn Universiy Tipei, Tiwn Tile:

More information

Transforms II - Wavelets Preliminary version please report errors, typos, and suggestions for improvements

Transforms II - Wavelets Preliminary version please report errors, typos, and suggestions for improvements EECS 3 Digil Signl Processing Universiy of Cliforni, Berkeley: Fll 007 Gspr November 4, 007 Trnsforms II - Wveles Preliminry version plese repor errors, ypos, nd suggesions for improvemens We follow n

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ]

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ] /3/5 Assignmen Previewer 3 Bsic: Definie Inegrls (67795) Due: Wed Apr 5 5 9: AM MDT Quesion 3 5 6 7 8 9 3 5 6 7 8 9 3 5 6 Insrucions Red ody's Noes nd Lerning Gols. Quesion Deils In Vocb [37897] The chnge

More information

CBSE 2014 ANNUAL EXAMINATION ALL INDIA

CBSE 2014 ANNUAL EXAMINATION ALL INDIA CBSE ANNUAL EXAMINATION ALL INDIA SET Wih Complee Eplnions M Mrks : SECTION A Q If R = {(, y) : + y = 8} is relion on N, wrie he rnge of R Sol Since + y = 8 h implies, y = (8 ) R = {(, ), (, ), (6, )}

More information

Echocardiography Project and Finite Fourier Series

Echocardiography Project and Finite Fourier Series Echocardiography Projec and Finie Fourier Series 1 U M An echocardiagram is a plo of how a porion of he hear moves as he funcion of ime over he one or more hearbea cycles If he hearbea repeas iself every

More information

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1 D Trnsformions Compuer Grphics COMP 770 (6) Spring 007 Insrucor: Brndon Lloyd /6/07 Geomery Geomeric eniies, such s poins in spce, exis wihou numers. Coordines re nming scheme. The sme poin cn e descried

More information

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM Elecronic Journl of Differenil Equions, Vol. 208 (208), No. 50, pp. 6. ISSN: 072-669. URL: hp://ejde.mh.xse.edu or hp://ejde.mh.un.edu EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE

More information

RESPONSE UNDER A GENERAL PERIODIC FORCE. When the external force F(t) is periodic with periodτ = 2π

RESPONSE UNDER A GENERAL PERIODIC FORCE. When the external force F(t) is periodic with periodτ = 2π RESPONSE UNDER A GENERAL PERIODIC FORCE When he exernl force F() is periodic wih periodτ / ω,i cn be expnded in Fourier series F( ) o α ω α b ω () where τ F( ) ω d, τ,,,... () nd b τ F( ) ω d, τ,,... (3)

More information

graph of unit step function t

graph of unit step function t .5 Piecewie coninuou forcing funcion...e.g. urning he forcing on nd off. The following Lplce rnform meril i ueful in yem where we urn forcing funcion on nd off, nd when we hve righ hnd ide "forcing funcion"

More information

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation Mching Inpu: undireced grph G = (V, E). Biprie Mching Inpu: undireced, biprie grph G = (, E).. Mching Ern Myr, Hrld äcke Biprie Mching Inpu: undireced, biprie grph G = (, E). Mflow Formulion Inpu: undireced,

More information

Introduction to LoggerPro

Introduction to LoggerPro Inroducion o LoggerPro Sr/Sop collecion Define zero Se d collecion prmeers Auoscle D Browser Open file Sensor seup window To sr d collecion, click he green Collec buon on he ool br. There is dely of second

More information

Some Inequalities variations on a common theme Lecture I, UL 2007

Some Inequalities variations on a common theme Lecture I, UL 2007 Some Inequliies vriions on common heme Lecure I, UL 2007 Finbrr Hollnd, Deprmen of Mhemics, Universiy College Cork, fhollnd@uccie; July 2, 2007 Three Problems Problem Assume i, b i, c i, i =, 2, 3 re rel

More information

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17 EES 16A Designing Informaion Devices and Sysems I Spring 019 Lecure Noes Noe 17 17.1 apaciive ouchscreen In he las noe, we saw ha a capacior consiss of wo pieces on conducive maerial separaed by a nonconducive

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

How to prove the Riemann Hypothesis

How to prove the Riemann Hypothesis Scholrs Journl of Phsics, Mhemics nd Sisics Sch. J. Phs. Mh. S. 5; (B:5-6 Scholrs Acdemic nd Scienific Publishers (SAS Publishers (An Inernionl Publisher for Acdemic nd Scienific Resources *Corresonding

More information

SOLUTIONS TO ECE 3084

SOLUTIONS TO ECE 3084 SOLUTIONS TO ECE 384 PROBLEM 2.. For each sysem below, specify wheher or no i is: (i) memoryless; (ii) causal; (iii) inverible; (iv) linear; (v) ime invarian; Explain your reasoning. If he propery is no

More information

15/03/1439. Lecture 4: Linear Time Invariant (LTI) systems

15/03/1439. Lecture 4: Linear Time Invariant (LTI) systems Lecre 4: Liner Time Invrin LTI sysems 2. Liner sysems, Convolion 3 lecres: Implse response, inp signls s coninm of implses. Convolion, discree-ime nd coninos-ime. LTI sysems nd convolion Specific objecives

More information

TIMELINESS, ACCURACY, AND RELEVANCE IN DYNAMIC INCENTIVE CONTRACTS

TIMELINESS, ACCURACY, AND RELEVANCE IN DYNAMIC INCENTIVE CONTRACTS TIMELINESS, ACCURACY, AND RELEVANCE IN DYNAMIC INCENTIVE CONTRACTS by Peer O. Chrisensen Universiy of Souhern Denmrk Odense, Denmrk Gerld A. Felhm Universiy of Briish Columbi Vncouver, Cnd Chrisin Hofmnn

More information

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445 CS 445 Flow Nework lon Efr Slide corey of Chrle Leieron wih mll chnge by Crol Wenk Flow nework Definiion. flow nework i direced grph G = (V, E) wih wo diingihed erice: orce nd ink. Ech edge (, ) E h nonnegie

More information

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x) Properies of Logrihms Solving Eponenil nd Logrihmic Equions Properies of Logrihms Produc Rule ( ) log mn = log m + log n ( ) log = log + log Properies of Logrihms Quoien Rule log m = logm logn n log7 =

More information

On the Pseudo-Spectral Method of Solving Linear Ordinary Differential Equations

On the Pseudo-Spectral Method of Solving Linear Ordinary Differential Equations Journl of Mhemics nd Sisics 5 ():136-14, 9 ISS 1549-3644 9 Science Publicions On he Pseudo-Specrl Mehod of Solving Liner Ordinry Differenil Equions B.S. Ogundre Deprmen of Pure nd Applied Mhemics, Universiy

More information

Physic 231 Lecture 4. Mi it ftd l t. Main points of today s lecture: Example: addition of velocities Trajectories of objects in 2 = =

Physic 231 Lecture 4. Mi it ftd l t. Main points of today s lecture: Example: addition of velocities Trajectories of objects in 2 = = Mi i fd l Phsic 3 Lecure 4 Min poins of od s lecure: Emple: ddiion of elociies Trjecories of objecs in dimensions: dimensions: g 9.8m/s downwrds ( ) g o g g Emple: A foobll pler runs he pern gien in he

More information

Green s Functions and Comparison Theorems for Differential Equations on Measure Chains

Green s Functions and Comparison Theorems for Differential Equations on Measure Chains Green s Funcions nd Comprison Theorems for Differenil Equions on Mesure Chins Lynn Erbe nd Alln Peerson Deprmen of Mhemics nd Sisics, Universiy of Nebrsk-Lincoln Lincoln,NE 68588-0323 lerbe@@mh.unl.edu

More information

A Simple Method to Solve Quartic Equations. Key words: Polynomials, Quartics, Equations of the Fourth Degree INTRODUCTION

A Simple Method to Solve Quartic Equations. Key words: Polynomials, Quartics, Equations of the Fourth Degree INTRODUCTION Ausrlin Journl of Bsic nd Applied Sciences, 6(6): -6, 0 ISSN 99-878 A Simple Mehod o Solve Quric Equions Amir Fhi, Poo Mobdersn, Rhim Fhi Deprmen of Elecricl Engineering, Urmi brnch, Islmic Ad Universi,

More information

Temperature Rise of the Earth

Temperature Rise of the Earth Avilble online www.sciencedirec.com ScienceDirec Procedi - Socil nd Behviorl Scien ce s 88 ( 2013 ) 220 224 Socil nd Behviorl Sciences Symposium, 4 h Inernionl Science, Socil Science, Engineering nd Energy

More information

An object moving with speed v around a point at distance r, has an angular velocity. m/s m

An object moving with speed v around a point at distance r, has an angular velocity. m/s m Roion The mosphere roes wih he erh n moions wihin he mosphere clerly follow cure phs (cyclones, nicyclones, hurricnes, ornoes ec.) We nee o epress roion quniiely. For soli objec or ny mss h oes no isor

More information

The Taiwan stock market does follow a random walk. Abstract

The Taiwan stock market does follow a random walk. Abstract The Tiwn soc mre does follow rndom wl D Bue Loc Feng Chi Universiy Absrc Applying he Lo nd McKinly vrince rio es on he weely reurns from he Tiwn soc mre from 990 o mid 006, I obined resuls srongly indicive

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

Bayesian Inference for Static Traffic Network Flows with Mobile Sensor Data

Bayesian Inference for Static Traffic Network Flows with Mobile Sensor Data Proceedings of he 51 s Hwii Inernionl Conference on Sysem Sciences 218 Byesin Inference for Sic Trffic Nework Flows wih Mobile Sensor D Zhen Tn Cornell Universiy z78@cornell.edu H.Oliver Go Cornell Universiy

More information

Solutions for Nonlinear Partial Differential Equations By Tan-Cot Method

Solutions for Nonlinear Partial Differential Equations By Tan-Cot Method IOSR Journl of Mhemics (IOSR-JM) e-issn: 78-578. Volume 5, Issue 3 (Jn. - Feb. 13), PP 6-11 Soluions for Nonliner Pril Differenil Equions By Tn-Co Mehod Mhmood Jwd Abdul Rsool Abu Al-Sheer Al -Rfidin Universiy

More information