Deep Renforcement Learnng wth Experence Replay Baed on SARSA Dongbn Zhao, Hatao Wang, Kun Shao and Yuanheng Zhu Key Laboratory of Management and Control for Complex Sytem Inttute of Automaton Chnee Academy of Scence, Bejng 100190, Chna dongbn.zhao@a.ac.cn, wanghatao8118@163.com, haokun2014@a.ac.cn, yuanheng.zhu@a.ac.cn Abtract SARSA, a one knd of on-polcy renforcement learnng method, ntegrated wth deep learnng to olve the vdeo game control problem n th paper. We ue deep convolutonal neural network to etmate the tate-acton value, and SARSA learnng to update t. Bede, experence replay ntroduced to make the tranng proce utable to calable machne learnng problem. In th way, a new deep renforcement learnng method, called deep SARSA propoed to olve complcated control problem uch a mtatng human to play vdeo game. From the experment reult, we can conclude that the deep SARSA learnng how better performance n ome apect than deep Q learnng. Keyword SARSA learnng; Q learnng; experence replay; deep renforcement learnng; deep learnng I. INTRODUCTION Wth the development of artfcal ntellgence (AI), more and more ntellgent devce come nto ue n our daly lve. In the face of unknown complcated envronment, thee ntellgent devce hould know how to perceve the envronment and make decon accordngly. In the lat 20 year, deep learnng (DL) [1] baed on neural network ha greatly promoted the development of hgh dmenon nformaton percepton problem. Wth the powerful generalzng ablty, t can retreve hghly abtract tructure or feature from the real envronment and then precely depct the complcated dependence between raw data uch a mage and vdeo. Wth excellent ablty of feature detecton, DL ha been appled to many learnng tak, uch a handwrtten dgt recognton [2], cenaro analy [3] and o on. Although t ha acheved great breakthrough n nformaton percepton epecally n mage clafyng problem, DL ha t natural drawback. It can not drectly elect polcy or deal wth decon-makng problem, reultng n lmted applcaton n ntellgent control feld. Dfferent from DL, renforcement learnng (RL) one cla of method whch try to fnd optmal or near-optmal polcy for complcated ytem or agent [4-6]. A an effectve decon-makng method, t ha been ntroduced nto optmal control [7], model free control [8, 9] and o on. Generally, RL ha two clae polcy teraton and value teraton. On the other hand, t can alo be dvded nto off-polcy and on-polcy method. Common RL method nclude Q learnng, SARSA learnng, TD( ) and o on [4]. Though RL naturally degned to deal wth decon-makng problem, t ha run nto great dffculte when handlng hgh dmenon data. Wth the development of feature detecton method lke DL, uch problem are to be well olved. A new method, called deep renforcement learnng (DRL), emerge to lead the drecton of advanced AI reearch. DRL combne excellent percevng ablty of DL wth deconmakng ablty of RL. In 2010, Lange [10] propoed a typcal algorthm whch appled a deep auto-encoder neural network (DANN) nto a vual control tak. Later, Abtah and Fael [11] employed a deep belef network (DBN) a the functon approxmaton to mprove the learnng effcency of tradtonal neural ftted-q method. Then, Rrel [12] gave the complete defnton of deep renforcement learnng n 2012. Mot mportantly, the group of DeepMnd ntroduced deep Q network (DQN) [13, 14] whch utlze convoluton neural network (CNN) ntead of tradtonal Q network. Ther method ha been appled to vdeo game platform called Arcade Learnng Envronment (ALE) [15] and can even obtan hgher core than human player n ome game, lke breakout. Baed on ther work, Levne [16] appled recurrent neural network to the framework propoed by DeepMnd. In 2015, DeepMnd put forward a new framework of DRL baed on Monte Carlo tree earch (MCTS) to traned a Go agent called AlphaGo [17], whch beat one of the mot excellent human player Lee Sedol n 2016. Th match rae the people nteret n DRL whch leadng the trend of AI. Though DQN algorthm how excellent performance n vdeo game, t alo ha drawback, uch a low effcent data amplng proce and defect of offpolcy RL method. In th paper, we focu on a brand new DRL baed on SARSA learnng, alo called deep SARSA for mtatng human player n playng vdeo game. The deep SARSA method ntegrated wth experence replay proce propoed. To the bet of our knowledge, th the frt attempt to combne SARSA learnng wth DL for complcated ytem. The paper organzed a follow. In Secton II, SARSA learnng and ALE are ntroduced a the background and prelmnary. Then a new deep SARSA propoed n Secton III to olve complcate control tak uch a vdeo game. Two mulaton reult are gven to valdate the effectvene of the propoed deep SARSA n Secton IV. In the end we draw a concluon. Th work wa upported n part by Natonal Natural Scence Foundaton of Chna (No. 61273136, 61573353, 61533017 and 61603382).
II. SARSA LEARNING AND ARCADE LEARNING ENVIRONMENT A. Q learnng and SARSA learnng Conderng a Markov decon proce (MDP), the goal of learnng tak to maxmze the future reward when the agent nteract wth envronment. Generally, we defne the future reward from the tme tep a T k R() t rtk 1 k 0 t, where the reward when an acton taken at tme and T often regarded a the tme when the proce termnate. In addton, and (0,1] the dcount factor, t r t T 1can t be atfed multaneouly. Then the tate-acton Q (, a) value functon can be defned a agent take the acton at the tate Therefore, a k t k 1 t t whch ndcate the under the polcy. Q (, a) E { R( t) t, at a}, (2) E { r, a a} k 0 where E { R( t) t, at a} the expected return, and the polcy functon over acton. Now the learnng tak am at obtanng the optmal tate-acton functon Q * (, a ) whch uually relevant to Bellman equaton. Then two method called Q learnng and SARSA learnng wll be compared to get the optmal tate-acton value functon. A one of the tradtonal RL algorthm, Q learnng an off-polcy method. The agent ndependently nteract wth the envronment whch often ndcate electng the acton a. Then the reward r feedback from the envronment and the next tate derved. Here Q(, a ) repreent the current tateacton value. In order to update the current tate-acton value functon, we employ the next tate-acton acton value to etmate t. Although the next tate acton a tll unknown. So the mot mportant prncple n Q learnng to take a greedy acton to maxmze the next Q(, a ). The update equaton ha been gven, the next Q(, a) Q(, a) [ r max a Q(, a) Q(, a)] where repreent the learnng rate. By contrat, SARSA learnng an on-polcy method. It mean when updatng the current tate-acton value, the next acton wll be taken. But n Q learnng, the acton completely greedy. Gven uch analy, the update equaton of tate-acton value can be defned a a a Q(, a) Q(, a) r Q(, a) Q(, a) Actually, the dfference between SARSA learnng and Q learnng le n the update equaton (3) and (4). In SARSA learnng, the tranng data quntuple- (, a, r,, a ). In every update proce, th quntuple wll be derved n equence. However, n Q learnng a jut for etmaton and wll not be taken n fact. B. Arcade Learnng Envronment Arcade Learnng Envronment (ALE) a wrapper or platform ncludng many vdeo game for Atar 2600. A a benchmark for new advanced RL algorthm, t preent ome nterference for the agent, ncludng the tate and the reward [15]. The tate are hgh dmenonal vual nput ( 210 160RGB vdeo at 60 Hz) a what human receve and the reward can be tranferred from the core gven by the envronment when the agent nteract wth the platform. Uually, the agent nteract wth ALE through a et of 18 acton, but only 5 of whch are bac acton. Thee 5 acton contan 4 movement drecton, and the remanng one hould be frng or null. To be clear, the reward or core come from the ytem output ntead of recognzng from the mage. ALE degned to be remarkably utable for tetng RL algorthm. So many group have appled ther algorthm to th platform ncludng the DQN method propoed by DeepMnd. Though DQN ha acheved excellent performance n vdeo game, t only combne bac Q learnng wth deep learnng. Many other dfferent renforcement learnng method can help mprove the performance of deep renforcement learnng lke the on-polcy method. In the next Secton, we wll preent a DRL method baed on SARSA learnng to mprove the tranng proce n vdeo game from ALE. III. DEEP REINFORCEMENT LEARNING METHOD BASED ON SARSA LEARNING Before deep renforcement learnng algorthm come out, many tradtonal RL method have been appled to ALE. Deafzo and Graepel [18] appled ome RL method to thoe complcated vdeo game. They compared the advantage and dadvantage of dfferent RL method uch a Q learnng, SARSA learnng, actor-crtc, GQ, R learnng and o on. The reult are lted n Table 1. TABLE I. THE PERFORMANCE OF DIFFERENT RL METHODS IN ALE [18] Relevant performance SARSA AC GQ Q R 1.00 0.99 0.65 0.82 0.96 From Table 1, the average performance of Q learnng only 82% of SARSA learnng n vdeo game. Though thee algorthm only ue hand craft feature, the reult above ndcate that SARSA learnng wll acheve better performance than Q learnng. So gven on thee fact, a new deep renforcement learnng method baed on SARSA learnng propoed a follow.
A. SARSA network Game from Atar 2600 can be regarded a a MDP whch wll be olved by RL algorthm. Here SARSA learnng wll be ntegrated to DRL framework. Smlar to DQN n [14], gven the current tate, the acton a elected by -greedy method. Then the next tate and the reward r wll be oberved. The current tate-acton value. So n DRL Q(, a) baed on SARSA, the current optmal tate-acton can be etmated by Q*(, a) E r Q(, a),a where a the next acton elected by -greedy. Smlarly, n deep SARSA learnng, the value functon approxmaton tll wth the convoluton neural network (CNN) whoe tructure hown n Fg. 1. The nput of the CNN the raw mage from vdeo game and the output the Q value of all acton. defned a parameter of the CNN. At the teraton of tranng, the lo functon of the network can be defned a th 2 L ( ) ( y Q(,a; )) where y r Q(, a; 1). Then the man objectve to optmze the lo functon L ( ). From the vew of uperved learnng, y regarded a the label n tranng though alo a varable. By dfferentatng (6), we get the gradent of the lo functon L ( ) ( r Q(, a; 1) Q(,a; )) Q(,a; ) y where Q(,a; ) the gradent of the current tate-acton value. Then accordng to (7), we can optmze the lo functon by tochatc gradent decent (SGD), Adadelta and o on [19]. Bede, the renforcement learnng proce hould alo be taken nto conderaton. The lat layer of the network output the Q value of each acton. So we can elect the acton and update t by the SARSA method. Fg. 2 depct the forward data flow of SARSA network n the tranng proce. nput conv pool conv pool target Feature decteton Fg. 1 The convoluton neural network n DRL Fg. 2 The forward data flow n DRL The Feature Extracton n Fg. 2 can be een a the mage preproce and CNN network. After tate-acton obtaned, proper acton elected to make decon by the SARSA learnng method. Later we ntroduce experence replay technque [20] to mprove the tranng proce of DRL and adapt renforcement learnng to calable machne learnng proce. B. Experence replay In tradtonal renforcement learnng method, the learnng and updatng hould contnue n equence. That to ay, every ample can tmulate one update, thu makng the learnng proce rather low. In order to adapt to the calable machne learnng proce, the htorcal data tored n memory and wll be retraned later contnuouly. In Fg. 2, a quadruple et (,a,r,) kept n htorcal data D e,, 1 en, where ndcate the ze of htorcal tack. Then n the tranng proce, we ample the tranng data from th tack D. There are ome method n whch ample can be obtaned, uch a conecutve amplng, unform amplng, and weghted amplng method by reward. Here we follow the method of unform amplng method n DQN, whch ha two advantage. Frtly, the effcency of data uage mproved. Then conecutve ample mght be greatly relevant to each other. Unform method (,a,r, ) ~ U( D) can reduce the correlaton between nput data [14]. N Before raw mage from vdeo game are ampled, ome preproce mut be dealt wth. We can obtan every frame from the vdeo. However, t would be le effcent f one frame regarded a the tate. Addtonally, conecutve frame mght contan mportant feature n mage a the peed or geometrcal relatonhp whch can contrbute more to the performance of agent. Once a ngle frame traned, all thoe vtal feature are abandoned. So, n th paper one acton taken wth every 4 frame, the ame a [14]. The 4 frame are concatenated a the tate. The concatenaton defned a functon. After beng proceed, the tate are tored n tack D. The next ecton wll ntroduce the whole proce of DRL baed on SARSA learnng. C. Deep SARSA learnng Gven the number of vdeo game n, the SARSA network hould contan output whch repreent dcrete tateacton value, to nteract wth ALE. The current tate proceed by CNN to get the current tate-acton value Q 1, whch a n -dmenon vector. Then the current acton a elected wth -greedy algorthm. The reward r and the next tate oberved. In order to etmate the current Q(, a ), the next tate-acton value Q(, a) obtaned accordng to (4). n n
Here, when the next tate nput nto CNN, can be obtaned. Then we defne a label vector related to Q(, a) beng whch repreent the target vector. The two vector only have one dfferent component. That r Q(, a) Q(,a). Now the whole cheme of DRL baed on SARSA learnng preented n Algorthm 1. It hould be noted that durng tranng, the next acton a for etmatng the current tate-acton value never greedy. On the contrary, there a tny probablty that a random acton choen. Q 2 Q 1 Algorthm1 Deep Renforcement Learnng baed on SARSA 1: ntalze data tack D wth ze of N and parameter of CNN 2: for epode=1, M do 3: ntalze tate and preproce tate 4: elect 5: for do 6: take acton { x } 1 1 a 1 wth -greedy method t 1, T t1 t1 a t ( ) ( ) 1 1, oberve next tate xt 1 and 7: tore data ( t, a t,rt, t 1) nto tack D 8: ample data from tack D elect a wth -greedy method 9: rj f epode temnate at tep j 1 y j rj Q( j1, a; ) otherwe 10: accordng to (7), optmze the lo functon L ( ) at a 11: end for 12: end for r t, Fg. 3 Two vdeo game: breakout and eaquet. Fg. 4 and Fg. 5 preent the average core wth deep SARSA learnng and deep Q learnng. We can ee that at the end of the 20th epoch, deep SARSA learnng reache an average reward of about 100. By contrat, deep Q learnng can reach about 170. We can conclude that n the early tage of tranng, deep SARSA learnng converge lower than deep Q learnng. However, after 30 epoch, deep SARSA learnng gan hgher average core. In addton, deep SARSA learnng converge more tably than deep Q learnng. IV. EXPERIMENTS AND RESULTS In th ecton, two mulaton experment wll be preented to verfy our algorthm. The two vdeo game are from Atar 2600, called breakout and eaquet. Fg. 3 how the mage of the two game. The CNN contan 3 convoluton layer and two full connected layer. All the ettng n thee two experment are the ame a DQN [14], except for the RL method. The dcount factor 0.99. Every 250 thouand tep, the agent teted. Every tetng epode are 125 thouand tep. Fg. 4 Average core wth deep SARSA learnng n Breakout A. Breakout In breakout, 5 bac acton ncludng up, down, left, rght and null are gven. The operaton mage lke the left of Fg. 3. Th game expect the agent to obtan a many core a poble. The agent control dam-board whch can reflect the bullet. Once the bullet ht brck n the top area, the agent get 1 pont. If the bullet fall down, the number of lve ubtracted 1 untl the game over. Fg. 5 Average core wth deep Q learnng n Breakout
The number of game durng tet wth two algorthm dplayed n Fg. 6 and 7. It reflect the convergent trend of thee algorthm. After tranng 20 epoch, deep SARSA learnng can alo converge to the equlbrum pont at about 75. In deep Q learnng, the equlbrum pont about 80. Fg. 8 Average core wth deep SARSA learnng n Seaquet Fg. 6 Number of game durng tet wth deep SARSA learnng n Breakout Fg. 9 Average core wth deep Q learnng n Seaquet Fg. 7 Number of game durng tet wth deep Q learnng n Breakout B. Seaquet In eaquet, 5 bac acton are gven ncludng up, down, left, rght and frng. The operaton mage hown n the rght of Fg. 3. Th game expect that the agent hould obtan a many core a poble by avng dver and kllng fh. The agent can control the ubmarne wth fve bac acton a mentoned above. Once the ubmarne ave the dver or kll fh, the agent get 20 and 40 pont. If the ubmarne run nto fh or the oxygen n the ubmarne 0, the number of lfe drop 1 untl the game over. So f human play th game, the quantty of oxygen hould alo be taken nto conderaton. Fg. 8 and Fg. 9 how the average core of deep SARSA learnng and deep Q learnng. We can ee that the core of deep SARSA learnng ncreae a lttle lower before the 10th epoch than deep Q learnng. However, t wll converge much fater after the 30th epoch. At lat deep SARSA learnng can gan about 5000 pont whle deep Q learnng only get 3700 pont. The number of game durng tet wth two algorthm hown n Fg. 10 and Fg. 11. It can alo reflect the trend of DRL proce. Deep SARSA learnng even how a mother proce n th vdeo game than deep Q learnng. Fg. 10 Number of game durng tet wth deep SARSA learnng n Seaquet
Fg. 11 Number of game durng tet wth deep Q learnng n Seaquet V. CONCLUSION In th paper, we ntroduce an on-polcy method SARSA learnng to DRL. SARSA learnng ha ome advantage when beng appled to decon makng problem. It make learnng proce more table and more utable to ome complcated ytem. Gven thee fact, a new DRL algorthm baed on SARSA, called deep SARSA learnng, propoed to olve the control problem of vdeo game. Two mulaton experment are gven to compare the performance of deep SARSA learnng and deep Q learnng. In Secton 4, the reult reveal that deep SARSA learnng gan hgher core and fater convergence n breakout and eaquet than deep Q learnng. REFERENCES [1] LeCun, Y., Y. Bengo, and G. Hnton, Deep learnng. Nature, 2015, 521(7553): p. 436-444. [2] LeCun, Y., Bottou L. and Bengo Y, Gradent-baed learnng appled to document recognton. Proceedng of the IEEE, 1998, 86(11): p. 2278-2324. [3] Farabet, C., Coupre C, Najman L and LeCun Y, Scene parng wth multcale feature learnng, purty tree, and optmal cover. arxv preprnt arxv:1202.2160, 2012. [4] Sutton, R.S. and A.G. Barto, Introducton to renforcement learnng. 1998: MIT Pre. [5] Wang, F.Y., H. Zhang, and D. Lu, Adaptve dynamc programmng: an ntroducton. IEEE Computatonal Intellgence Magazne, 2009, 4(2): p. 39-47. [6] Zhao, D. and Y. Zhu, MEC--a near-optmal onlne renforcement learnng algorthm for contnuou determntc ytem. IEEE Tranacton on Neural Network and Learnng Sytem, 2015, 26(2): 346-356. [7] Zhu, Y., D. Zhao, and X. L, Ung renforcement learnng technque to olve contnuou-tme non-lnear optmal trackng problem wthout ytem dynamc. IET Control Theory & Applcaton, 2016, 10(12), 1339-1347. [8] Zhu, Y. and D. Zhao, A data-baed onlne renforcement learnng algorthm atfyng probably approxmately correct prncple. Neural Computng and Applcaton, 2015,26(4): p. 775-787. [9] Xa, Z. and D. Zhao, Onlne bayean renforcement learnng by gauan procee. IET Control Theory & Applcaton, 2016.10(12), 1331-1338. [10] Lange, S. and M. Redmller. Deep auto-encoder neural network n renforcement learnng. n The 2010 Internatonal Jont Conference on Neural Network (IJCNN). 2010. [11] Abtah, F. and I. Fael, Deep belef net a functon approxmator for renforcement learnng, n Proceedng of IEEE ICDL-EPIROB. 2011. [12] Arel, I., Deep Renforcement Learnng a Foundaton for Artfcal General Intellgence, n Theoretcal Foundaton of Artfcal General Intellgence. 2012, Sprnger. p. 89-102. [13] Mnh V, Kavukcuoglu K, Slver D, et al. Playng atar wth deep renforcement learnng. arxv preprnt arxv:1312.5602, 2013. [14] Mnh V, Kavukcuoglu K, Slver D, et al. Human-level control through deep renforcement learnng. Nature, 2015, 518(7540): 529-533. [15] Bellemare M G, Naddaf Y, Vene J, et al. The arcade learnng envronment: an evaluaton platform for general agent. Journal of Artfcal Intellgence Reearch, 2012, 47:253-279. [16] Levne, S., Explorng deep and recurrent archtecture for optmal control. arxv preprnt arxv:1311.1761, 2013. [17] Slver D, Huang A, Maddon C J, et al. Materng the game of Go wth deep neural network and tree earch. Nature, 2016, 529(7587): 484-489. [18] Defazo, A. and T. Graepel, A comparon of learnng algorthm on the Arcade Learnng Envronment. arxv preprnt arxv:1410.8620, 2014. [19] Zeler, M.D., ADADELTA: an adaptve learnng rate method. arxv preprnt arxv:1212.5701, 2012. [20] Ln, L.J., Renforcement learnng for robot ung neural network. 1993, Techncal report: DTIC Document.