REINFORCEMENT learning (RL) was originally studied

Size: px

Start display at page:

Download "REINFORCEMENT learning (RL) was originally studied"

Neil Morris
6 years ago
Views:

1 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH Multiobjective Reinforcement Lerning: A Comprehensive Overview Chunming Liu, Xin Xu, Senior Member, IEEE, nd Dewen Hu, Senior Member, IEEE Abstrct Reinforcement lerning (RL) is powerful prdigm for sequentil decision-mking under uncertinties, nd most RL lgorithms im to mximize some numericl vlue which represents only one long-term objective. However, multiple longterm objectives re exhibited in mny rel-world decision nd control systems, so recently there hs been growing interest in solving multiobjective reinforcement lerning (MORL) problems where there re multiple conflicting objectives. The im of this pper is to present comprehensive overview of MORL. The bsic rchitecture, reserch topics, nd nïve solutions of MORL re introduced t first. Then, severl representtive MORL pproches nd some importnt directions of recent reserch re comprehensively reviewed. The reltionships between MORL nd other relted reserch re lso discussed, which include multiobjective optimiztion, hierrchicl RL, nd multigent RL. Moreover, reserch chllenges nd open problems of MORL techniques re suggested. Index Terms Mrkov decision process (MDP), multiobjective reinforcement lerning (MORL), Preto front, reinforcement lerning (RL), sequentil decision-mking. I. INTRODUCTION REINFORCEMENT lerning (RL) ws originlly studied from the perspective of niml lerning behviors [1], nd it hs become mjor clss of mchine lerning methods [2] to solve sequentil decision mking problems under uncertinties [3], [4]. In n RL system, lerning gent ims to lern n optiml ction policy vi interctions with n uncertin environment. At ech step, the lerning gent is not provided explicitly wht ction to tke, nd insted it must determine the best ction to mximize long-term rewrds nd execute it. Then the selected ction mkes the current stte of the environment to trnsit into its successive stte, nd the gent receives sclr rewrd signl tht evlutes the effect of this stte trnsition, s shown in Fig. 1. Thus, there is feedbck rchitecture in lerning system bsed on RL nd the interction between the lerning gent nd its environment cn Mnuscript received Februry 16, 2014; revised June 11, 2014; ccepted August 15, Dte of publiction October 8, 2014; dte of current version Februry 12, This work ws supported in prt by the Progrm for New Century Excellent Tlents in Universities under Grnt NCET nd in prt by the Ntionl Fundmentl Reserch Progrm of Chin under Grnt 2013CB This pper ws recommended by Associte Editor A. H. Tn. C. Liu is with the College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin. X. Xu is with the Institute of Unmnned Systems, College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin (e-mil: xinxu@nudt.edu.cn). D. Hu is with the Deprtment of Automtic Control, College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin (e-mil: dwhu@nudt.edu.cn). Digitl Object Identifier /TSMC Fig. 1. Bsic RL scenrio. be described by sequence of sttes, ctions, nd rewrds. This sequentil decision process is usully modeled s Mrkov decision process (MDP). The rule or strtegy for ction selection is clled policy. In RL, the gent lerns optiml or ner-optiml ction policies from such interctions in order to mximize some notion of long-term objectives. In the pst decdes, there hs been lrge number of works on RL theory nd lgorithms [5] [8]. By focusing on the computtionl efforts long stte trnsition trjectories nd using function pproximtion techniques for estimting vlue functions or policies, RL lgorithms hve produced good results in some chllenging rel-world problems [9], [10]. However, despite of mny dvnces in RL theory nd lgorithms, one remined chllenge is to scle up to lrger nd more complex problems. The scling problem for sequentil decision-mking minly includes the following spects [11]. A problem tht hs very lrge or continuous stte or ction spce, problem tht is best described s set of hierrchiclly orgnized tsks nd sub-tsks, nd problem tht needs to solve severl tsks with different rewrds simultneously. An RL problem in the lst spect is clled multiobjective RL (MORL) problem, which refers to the sequentil decision mking problem with multiple objectives. MORL hs been regrded s n importnt reserch topic, due to the multiobjective chrcteristics of mny prcticl sequentil decision-mking nd dptive optiml control problems in the rel world. Compred with conventionl RL problems, MORL problems require lerning gent to obtin ction policies tht cn optimize two or more objectives t the sme time. In MORL, ech objective hs its own ssocited rewrd signl, so the rewrd is not sclr vlue but vector. When ll the objectives re directly relted, single objective cn be derived by combining the multiple objectives together. If ll the objectives re completely unrelted, they cn be optimized seprtely nd we cn find combined policy to optimize ll of them. However, if there re conflicting objectives, ny policy cn only mximize one of the objectives, or relize trde-off mong the conflicting objectives [12]. Therefore, MORL cn be viewed s the combintion c 2014 IEEE. Personl use is permitted, but republiction/redistribution requires IEEE permission. See for more informtion.

2 386 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 of multiobjective optimiztion (MOO) nd RL techniques to solve the sequentil decision mking problems with multiple conflicting objectives. In the MOO domin, there re two common strtegies [13]: one is the multiobjective to single-objective strtegy nd the other is the Preto strtegy. The former strtegy is to optimize sclr vlue, including the weighted sum method [14], the constrint method [15], the sequentil method [16], nd the mx-min method, etc. [17]. In these methods, sclr vlue is computed from the multiple objectives for the utility of n ction decision, nd the conventionl single-objective optimiztion (SOO) techniques cn be used. The ltter strtegy is to use the vector-vlued utilities. In this cse, it is difficult to order the cndidte solutions completely, nd the Preto optimlity concept is usully used. The Preto optiml solutions re defined s noninferior nd lterntive solutions mong the cndidte solutions, nd they represent the optiml solutions for some possible trde-offs mong the multiple conflicting objectives [12]. All Preto optiml solutions constitute the Preto front nd one mjor reserch issue of MOO is to find or pproximte the Preto front. Similr to the MOO domin, MORL lgorithms cn be divided into two clsses bsed on the number of lerned policies. One clss comprises single-policy MORL pproches nd the other, multiple-policy MORL pproches [12], [18]. Single-policy pproches im to find the best single policy which represents the preferences mong the multiple objectives s specified by user or derived from the problem domin. The mjor difference mong single-policy pproches is the wy for determining nd expressing these preferences. The im of multiple-policy MORL pproches is to find set of policies tht pproximte the Preto front. The min difference mong multiple-policy pproches is the pproximtion scheme for the Preto front. Mjor pproches to MORL will be further discussed in Section IV. Although there hve been some recent dvnces in different MORL lgorithms, mny reserch chllenges still remin in developing MORL theory nd lgorithms for rel-world problems. In ddition, ccording to the uthors knowledge, there is only one relted survey pper tht hs been published in the literture. However, it covers the much broder topic of multiobjective sequentil decision-mking [18]. Therefore, the im of this pper is to provide comprehensive review of MORL principles, lgorithms nd some open problems. A representtive set of MORL pproches re selected to show the overll frmework of the field, to present summry of mjor chievements, nd to suggest some open problems for future reserch. The reminder of this pper is orgnized s follows. In Section II, the bckground of MORL is briefly introduced, including MDP, RL, nd MOO. The bsic rchitecture, reserch topics, nd nïve solutions of MORL re described in Section III. A representtive set of pproches to MORL re reviewed in Section IV. In Section V, some importnt directions of recent reserch on MORL re discussed in detil. The relted works re introduced in Section VI. Section VII nlyzes the chllenges nd open problems of MORL. Section VIII concludes this comprehensive overview. II. BACKGROUND In this section, the necessry bckgrounds on MDP models, RL techniques, nd MOO problems re introduced. Firstly, n MDP is chrcterized s the formultion of sequentil decision-mking problem. Then, some bsic RL techniques re introduced, where the discussion is restricted to finite stte nd ction spces, since most MORL results up to now re given for finite spces. At lst, the MOO problem is introduced, s well s the concept of Preto optimlity. A. MDP Models A sequentil decision-mking problem cn be formulted s n MDP which is defined s 4-tuple {S, A, R, P}. In this 4-tuple, S is the stte spce of finite set of sttes, A is the ction spce of finite set of ctions, R is the rewrd function nd P is the mtrix of stte trnsition probbility. After stte trnsition from stte s to stte s when tking ction, p(s,, s ) nd r(s,, s ) represent the probbility nd the rewrd of the stte trnsition, respectively. An ction policy of the MDP is defined s function π : S Pr(A), where Pr(A) is probbility distribution in A. Due to the different influences of future rewrds on the present vlue, there re two different objective functions of n MDP. One is the discounted rewrd criteri, which is to estimte the optiml policy π stisfying the following eqution: [ ] J π = mx J π = mx E π γ t r t (1) π π where γ (1 >γ >0) is the discount fctor nd r t =r(x t, t ) is the rewrd t time step t, E π [ ] stnds for the expecttion with respect to the policy π nd the probbility mtrix P, nd J π is the expected totl rewrd. The other one is clled the verge rewrd criteri, which is to estimte the optiml policy π stisfying the following eqution: ρ π = mx ρ π = mx π π { lim n t=0 1 n 1 n t=0 E π [r t ] where ρ π is the verge rewrd per time step for the policy π. For the discounted rewrd criteri, the stte vlue function nd the stte-ction vlue function for policy π re defined by [ ] V π (s) = E π γ t r t s 0 = s (3) t=0 } (2) [ ] Q π (s, ) = E π γ t r t s 0 = s, 0 =. (4) t=0 According to the theory of dynmic progrmming (DP) [20], the following Bellmn equtions re stisfied: V π [ (s) = E π r(s, ) + γ V π ( s )] (5) Q π [ (s, ) = R(s, ) + γ E π Q π ( s, )] (6) where R(s, ) is the expected rewrd received fter tking ction in stte s, s is the successive stte of s, nd π(s, ) represents the probbility of ction tken by policy π in stte s.

3 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 387 The optiml stte-ction vlue function is defined s Q (s, ) = mx Q π (s, ). π When Q*(s,) is obtined, the optiml policy π* cn be computed by π (s) = rg mx Q (s, ) where the optiml policy π* is deterministic policy, nd it is projection from S to A. For the verge rewrd criteri, let ρ π [see (2)] be the verge rewrd per time step for policy π. The reltive stte vlue function nd the reltive stte-ction vlue function re defined s [4] V π (s) = E π [r t ρ π s 0 = s] Q π (s, ) = t=0 E π [r t ρ π s 0 = s, 0 = ] t=0 nd the following Bellmn equtions re stisfied: V π [ (s) = E π rt + V π ( s )] ρ π Q π (s, ) = R(s, ) ρ π + E π [ Q π ( s, )]. (7) The optiml reltive stte-ction vlue function for verge rewrd settings stisfies Q { (s, )+ρ π = mx Q π } (s, )+ρ π π nd the optiml policy π cn lso be obtined by π (s) = rg mx Q (s, ). B. Bsic RL Algorithms The erlier pproch to solve MDPs with model informtion is to use the dynmic progrmming (DP) techniques, which compute the optiml policies by estimting the optiml stte-ction vlue functions. However, trditionl DP lgorithms commonly require full model informtion nd lrge mounts of computtion re needed for lrge stte nd ction spces. Different from DP, RL gent lerns n optiml or ner-optiml policy by intercting with the environment whose dynmic model is ssumed to be unknown [4], [19]. As indicted in [4], bsed on the observed stte trnsition dt of MDPs, RL lgorithms integrte the techniques of Monte Crlo, stochstic pproximtion, nd function pproximtion to obtin pproximte solutions of MDPs. As centrl mechnism of RL, temporl-difference (TD) lerning [5] cn be viewed s combintion of Monte Crlo nd DP. On one hnd, like Monte Crlo methods, TD lgorithms cn lern the vlue functions using stte trnsition dt without model informtion. On the other hnd, similr to DP, before finl outcome is obtined, TD methods cn updte the current estimtion of vlue functions prtilly bsed on previous lerned results [4], [5]. For the discounted rewrd criteri, Q-lerning nd Srs re the most widely used tbulr RL lgorithms. The Q-lerning lgorithm is shown in Algorithm 1, where α is the lerning Algorithm 1 Q-Lerning Algorithm [4], [6] \\N: The mximum number of episodes 1: Initilize Q(s, ) rbitrrily; 2: repet (for ech episode i) 3: Initilize s; 4: repet (for ech step of episode) 5: Choose from s using policy derived from Q(s, ); 6: Tke ction, observe r, s ; 7: Q(s, ) Q(s, )+α[r +γ mx Q(s, ) Q(s, )]; 8: s s ; 9: until s is terminl 10: until i = N Algorithm 2 R-Lerning Algorithm [4], [22] \\ρ: The verge rewrd \\N: The mximum number of episodes 1: Initilize Q(s, ) nd ρ rbitrrily; 2: repet (for ech episode i) 3: s current stte; 4: Select from s using policy derived from Q(s, ); 5: Tke ction, observe r, s ; 6: Q(s, ) Q(s, ) + α[r ρ + γ mx Q(s, ) Q(s, )]; 7: if Q(s, ) = mx Q(s, ) then 8: ρ ρ + β[r ρ + mx Q(s, ) mx Q(s, )]; 9: end if 10: until i = N rte prmeter, nd r is the immedite rewrd. If in the limit the Q vlues of ll dmissible stte-ction pirs re updted infinitely often, nd α decys in wy stisfying the usul stochstic pproximtion conditions, then the Q vlues will converge to the optiml vlue Q* with probbility 1 [20]. For the Srs lgorithm, if ech ction is executed infinitely often in every stte tht is visited infinitely often, the ction is greedy with respect to the current Q vlue in the limit, nd the lerning rte decys ppropritely, then the estimted Q vlues will lso converge to the optiml vlue Q* with probbility 1 [21]. For the verge rewrd criteri, R-lerning [22] is the most widely studied RL lgorithm bsed on TD. The mjor steps of the R-lerning lgorithm re illustrted in Algorithm 2. C. MOO Problems The MOO problem cn be formulted s follows [23], [24]: mx F(X) = [ f 1 (X), f 2 (X),...,f mf (X) ] s.t. g i (X) 0, i = 1,...,m g where the mx opertor for vector is defined either in the sense of Preto optimlity or in the sense of mximizing weighted sclr of ll the elements, X = [x 1, x 2,...x N ] T R N is the vector of vribles to be optimized, functions g i (X) (i = 1, 2,...,m g ) re the constrint functions of this problem, nd f i (X) (i = 1, 2,...,m f ) re the objective functions. The optiml solutions of n MOO problem cn be described by two concepts. One is the concept of multiobjective to single

388 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 Fig. 3. Bsic rchitecture of MORL. Fig. 2. Concepts of Preto dominnce nd Preto front [25]. () Preto dominnce.

4 388 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 Fig. 3. Bsic rchitecture of MORL. Fig. 2. Concepts of Preto dominnce nd Preto front [25]. () Preto dominnce. (b) Preto front. objective, in which synthetic objective function is derived, nd the optiml solution of this MOO problem cn be obtined by solving SOO problem. The other one is the concept of Preto dominnce nd Preto front [25]. When solution A is better thn nother solution C for t lest one objective, nd A is lso superior or t lest equl to C for ll the other objectives, the solution A is sid to dominte C. In MOO, it is preferble to find ll the dominting solutions insted of the dominted ones [25]. In the cse of m f = 2, s shown in Fig. 2(), solution C is dominted by A nd B, nd it is hrd to compre the overll performnce between A nd B. As indicted in [25], the Preto front cn be generted by deleting ll the dominted solutions from the set of ll possible solutions. From Fig. 2(b), it cn be seen tht the Preto front is the set of ll the blck points, nd the solutions corresponding to the gry points re dominted by t lest one element of the Preto front. Since it is difficult to obtin the complete Preto front for ny rel-world MOO problem, simplified wy for MOO is to find set of solutions tht pproximtes the rel Preto front [26]. III. MORL PROBLEM Before providing insights into the current stte of the rt, nd determining some importnt directions for future reserch, it is necessry to chrcterize the bsic rchitecture, min reserch topics, nd nïve solutions of MORL in dvnce. A. Bsic Architecture MORL is different from trditionl RL in tht there re two or more objectives to be optimized simultneously by the lerning gent, where rewrd vector is provided for the lerning gent t ech step. Fig. 3 shows the bsic rchitecture of MORL, where there re N objectives, nd r i (N i 1) is the ith feedbck signl of the gent s current rewrd vector which is provided by the environment. Obviously this bsic rchitecture illustrtes the cse of single gent tht hs to optimize its ction policies for set of different objectives simultneously. For ech objective i (N i 1) nd sttionry policy π, there is corresponding stte-ction vlue function Q π i (s, ), which stisfies the Bellmn eqution (6) or(7). Let MQ π (s, ) = [ Q π 1 (s, ), Qπ 2 (s, ),... Qπ N (s, ) ] T where MQ π (s, ) is the vectored stte-ction vlue function, nd it lso stisfies the Bellmn eqution. The optiml vectored stte-ction vlue function is defined s MQ (s, ) = mx π MQπ (s, ) (8) nd the optiml policy π* cn lso be obtined esily by π (s) = rgmx MQ (s, ). (9) In this bsic rchitecture, the optimiztion problems of (8) nd (9) re both MOO problems. B. Mjor Reserch Topics MORL is highly interdisciplinry field nd it refers to the integrtion of MOO methods nd RL techniques to solve sequentil decision mking problems with multiple conflicting objectives. The relted disciplines of MORL include rtificil intelligence, decision nd optimiztion theory, opertions reserch, control theory, nd so on. Reserch topics of MORL re interplyed by MOO nd RL, minly including the preferences mong different objectives (the preferences my vry with time), pproprite representtions of preferences, the pproximtion of the Preto front nd the design of efficient lgorithms for specific MORL problems. Therefore, one importnt tsk of MORL is to suitbly represent the designer s preferences or ensure the optimiztion priority with some policies in the Preto front. After ppropritely expressing the preferences, the remined tsk is to design efficient MORL lgorithms tht cn solve the sequentil decision mking problems bsed on observed stte trnsition dt. C. Nïve Solutions Like MOO problems, MORL pproches cn be divided into two groups bsed on the number of the policies to be lerned [12]: single-policy pproches nd multiple-policy pproches. The im of single-policy pproches is to obtin the best policy which simultneously stisfies the preferences mong the multiple objectives s ssigned by user or defined by the ppliction domin. In this cse, the nïve pproch to solve MORL problem is to design synthetic objective function TQ(s, ), which cn suitbly represent the overll preferences. Similr to the Q-lerning lgorithm, nïve solution of singlepolicy pproches to MORL is shown in Algorithm 3. The Q-vlue updte rule for ech objective cn be expressed s ( ( Q i (s, ) = (1 α)q i (s, ) + α r i + mx Q i s, ) ) where N i 1, nd α is the lerning rte prmeter. The overll single policy cn be determined bsed on TQ(s, ),

5 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 389 Algorithm 3 Nïve Solution of Single-Policy Approches to MORL \\K: The mximum number of episodes \\N: The number of objectives 1: Initilize TQ(s, ) rbitrrily; 2: repet (for ech episode j) 3: Initilize s; 4: repet (for ech step of episode) 5: Choose from s using policy derived from TQ(s, ); 6: Tke ction, observe r 1, r 2,..., s ; 7: for i = 1, 2,..., N do 8: Q i (s, ) Q i (s, ) + α[r i + γ mx Q(s, ) Q i (s, )]; 9: end for 10: Compute TQ(s, ); 11: s s ; 12: until s is terminl 13: until j = K which cn be derived using the Q-vlues for ll the objectives. The mjor difference mong single-policy pproches is the wy in which these preferences re expressed. By mking use of the synthetic objective function, the Q-vlues of every objective cn be utilized to firly distribute control ctions. In order to ensure the diversity in the policy spce for different optimiztion objectives, multiple-policy pproches hve been studied to obtin set of policies tht cn pproximte the Preto front. The mjor difference mong multiplepolicy pproches is the mnner in which the Preto front is pproximted. However, it is hrd to pproximte the Preto front directly in mny rel-world pplictions. One nïve solution of multiple-policy pproches is to find policies in the Preto front by using different synthetic objective functions. Obviously, if set of prmeters cn be specified in synthetic objective function, the optiml policy cn be lerned for this set of prmeters. In [27], it ws illustrted tht by running the sclr Q-lerning lgorithm independently for different prmeter settings, the MORL problem cn be solved in multiple-policy wy. IV. REPRESENTATIVE APPROACHES TO MORL According to the different representtions of preferences, severl typicl pproches to MORL hve been developed. In this section, seven representtive MORL pproches re reviewed nd discussed. Among these seven representtive pproches, the weighted sum pproch, W-lerning, the nlytic hierrchy process (AHP) pproch, the rnking pproch, nd the geometric pproch re single-policy pproches. The convex hull pproch nd the vrying prmeter pproch belong to multiple-policy pproches. A. Weighted Sum Approch In [28], n lgorithm bsed on gretest mss ws studied to estimte the combined Q-function. For Q-lerning bsed on the strtegy of gretest mss, the synthetic objective function Fig. 4. Concve region of the weighted sum pproch. is generted by summing the Q-vlues for ll the objectives TQ (s, ) = N Q i (s, ). (10) Bsed on the bove synthetic objective function, the ction with the mximl summed vlue is then chosen to be executed. Since Srs(0) is n on-policy (the smples used for weight updte re generted from the current ction policy) RL lgorithm nd it does not hve the problem of positive bis, GM-Srs(0) ws proposed for MORL in [11]. The positive bis my be cused by some off-policy RL methods which only use the estimtes of greedy ctions for lerning updtes. An dvntge of GM-Srs(0) is s follows: since the updtes re bsed on the ctully selected ctions rther thn the best ction determined by the vlue function, GM-Srs(0) is expected to hve smller errors between the estimted Q-vlues nd the true Q-vlues. A nturl extension of the GM-Srs(0) pproch is the weighted sum pproch, which computes linerly weighted sum of Q-vlues for ll the objectives TQ (s, ) = i=1 N w i Q i (s, ). i=1 The weights cn mke the user hve the bility to put more or less emphsis on ech objective. In [29], the weighted sum pproch ws employed to combine seven vehicle overtking objectives, nd there re three nvigtion modes tht were used to tune the weights. In [30], similr pproch ws used for the combintion of three objectives, which represent the degree of the crowd in n elevtor, the witing time, the number of strt-ends, respectively. Although the weighted sum pproch is very simple to be implemented, the ctions in concve regions of the Preto front my not be chosen so tht the Preto front cnnot be well pproximted during the lerning process [25]. For exmple, s shown in Fig. 4, if there re two objectives nd five cndidte ctions, it cn be seen tht ctions 2, 3, nd 4 re in the concve region of the Preto front, while ctions 1 nd 5 re vertices. For ll cndidte ctions for ny positive weights {w i }(N i 1), the liner weighted sums of Q-vlues of ctions 2, 3, nd 4 re not the mximum. Thus, ctions 2, 3 nd 4 will never be selected for greedy policies. Insted, one of the two ctions 1 nd 5 will be frequently chosen ccording to the preset weights. In order to overcome this drwbck, some nonliner functions

6 390 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 for weighting Q-vlues my be used for specific problem domins [31]. B. W-Lerning Approch In order to ensure tht the selected ction is optiml for t lest one objective, severl winner-tke-ll methods for MORL were studied in [32]. When the current stte is s, vluew i (s) is computed for ech objective. Then, the selected ction is bsed on the objective with the mximl W vlue. One simple method to compute W vlues is clled Top-Q [11], [32], which ssigns the W vlue s the highest Q-vlue in the current stte W i (s) = mx Q i (s, ) 1 i N. The lrgest W vlue cn be obtined s W mx (s) = mx i W mx (s) = mx therefore the selected ction is ã = rg mx W i (s)= mx { mx Q i (s, ) i i { { mx i Q i (s, ) } } mx Q i (s, ) } 1 i N (11) 1 i N. (12) The synthetic objective function for the Top-Q pproch cn be written s TQ(s, ) = mx Q i (s, ) 1 i N. (13) i In Top-Q, the selected ction is gurnteed to be optiml for t lest one objective. However, one drwbck of this pproch is tht the objective with the highest Q-vlue my hve similr priorities for different ctions, while other objectives cnnot be stisfied due to their low ction vlues. In ddition, since the Q-vlues depend on the scling of rewrd functions, chnge in rewrd scling my influence the results of the winnertke-ll contest in (11) (13). Therefore, lthough the Top-Q pproch my obtin good performnce in some cses, its behvior will be gretly influenced by the design of rewrd functions [11]. In order to overcome the bove drwbck, W-Lerning ws studied in [32] to compute the W vlues bsed on the following rule: W i (s) = (1 α)w i (s) + αp i (s) P i (s) = mx Q i (s, ) ( r i + γ mx Q i ( s, ) ) (14) where N i 1, s is the successive stte fter ction is executed. At ech step, fter selecting nd executing the ction with the highest W vlue, ll the W vlues, except the highest W vlue (the winner or the leder), re updted ccording to the bove rule in (14). Humphrys [32] pointed out tht it my be not necessry to lern the W vlues, nd insted they cn be computed directly from the Q-vlues in process clled Negotited W-lerning, s shown in Algorithm 4. The negotited W-lerning lgorithm cn explicitly find tht if n objective is not preferred to determine the next ction, it my be expected to lose the most long-term rewrd. Algorithm 4 Negotited W-Lerning [32] \\N: The number of objectives 1: Initilize leder l with rndom integer between 1 nd N; 2: Observe stte s; 3: W l =0; 4: l = rg mx Q l (s, ); 5: loop: 6: for ll objectives i except l do 7: W i = mx Q i (s, ) Q i (s, l ); 8: end for 9: if mx i W i > W l then 10: W l = mx i W i ; 11: l = rg mx Q i (s, ); 12: l=i; 13: go to 5; 14: else 15: terminte the loop; 16: end if 17: Return l ; C. AHP Approch [34] Generlly, the designer of MORL lgorithms my not hve enough prior knowledge bout the optimiztion problem. In order to express the informtion of preferences, some qulittive rules re usully employed, such s objective B is less importnt thn objective A. The qulittive rules specify the reltive importnce between two objectives but do not provide precise mthemticl description. Thus, MORL lgorithms cn mke use of the method of AHP to obtin quntified description of the synthetic objective function TQ(s, ). Compred with the originl AHP method in [33], the MORL method proposed in [34] cn solve sequentil decision-mking problems with vrible number of objectives. Bsed on the designer s prior knowledge of the problem, the degree of reltive importnce between two objectives cn be quntified by L grdes, nd sclr vlue is defined for ech grde. For exmple, in [34], L is set to be 6, nd the evlution of the importnce of objective i reltive to objective j is supposed to be c ij, where N i 1 nd N j 1. After determining the vlue of c ij, the reltive importnce mtrix for ll objectives C =(c ij ) N N cn be obtined. With mtrix C, the importnce fctor I i (for objective i) cn be clculted [34] where I i = SL i = SL i N SL j j=1 N c i,j j=1,j =i is the importnce of objective i reltive to ll other objectives. Then, for ech objective, fuzzy inference system cn be constructed. To compre two cndidte ctions p nd q ( p, q A), both the importnce fctor I i nd the vlue of

7 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 391 improvement D i ( p, q )=Q i (s, p ) Q i (s, q ) re used s the inputs of the fuzzy system, nd the output of the fuzzy system is the goodness of p reltive to q. By incorporting the fuzzy subsets nd fuzzy inference rules, n ction selection strtegy ws constructed to solve the MORL problem [34]. The min drwbck of this pproch is tht it requires lot of prior knowledge of the problem domin. D. Rnking Approch The rnking pproch, lso clled the sequentil pproch, or the threshold pproch, ims to solve multiobjective decision problems vi n ordering or preference reltion mong multiple criteri. The ide of using ordinl reltions in optiml decision mking ws studied in the erly reserch work by Mitten [35] nd Sobel [36]. The synthetic objective function TQ(s, ) ws expressed in terms of prtil policies. To ensure the effectiveness of the subordinte objective (the less importnt objective), multiple solutions need be obtined for the optimiztion problem of the min objective. Inspired by the ide of the rnking pproch, n ordering of multiple objectives ws estblished in [37] formorl where threshold vlues were specified for some objectives in order to put the constrints on the objectives. One exmple for this kind of situtions is tht n unmnned vehicle performs nvigtion tsks in n environment while voiding its fuel level from being empty. The MORL pproch in [37] optimizes one objective while putting constrints on other objectives. The ctions re chosen bsed on the thresholds nd lexicogrphic ordering (the lst objective is mximized t first) [12], [37]. Let CQ i (s, ) = min {Q i (s, ), C i } where N i 1, C i is the threshold vlue (the mximum llowble vlue) for objective i. Since objective N is ssumed to be unconstrined, C N =+. In the rnking pproch for MORL [12], [37], given prtil ordering of ll objectives nd their threshold vlues, ctions nd cn be compred by the ction comprison mechnism shown in Algorithm 5, where there is sub-function Superior() which ws recursively defined in [12]. Bsed on the ction comprison nd selection mechnism, the MORL problem cn be solved by combining this mechnism with some stndrd RL lgorithms such s Q-lerning. The performnce of the rnking-bsed MORL pproch is minly dependent on the ordering of the objectives s well s the threshold vlues. The design of n pproprite lexicogrphic ordering of ll the objectives nd their threshold vlues still requires some prior knowledge of the problem domin, which exhibits the designer s preferences [105]. Geibel [38] employed this ide to blnce the expected return nd risk, where the risk is smller thn some specified threshold, nd the problem ws formulted s constrined MDP. In [39], the rnking-bsed MORL pproch ws pplied to the routing problem in cognitive rdio networks to ddress Algorithm 5 Action Comprison Mechnism of the Rnking Approch [12] Superior(CQ(s i, ), CQ(s i, ), i); 1: if CQ(s i, )>CQ(s i, ) then 2: Return true; 3: else if CQ(s i, ) = CQ(s i, ) then 4: if i = N then 5: Return true; 6: else 7: Return Superior(CQ(s i+1, ), CQ(s i+1, ), i + 1); 8: end if 9: else 10: Return flse; 11: end if Fig. 5. Predicted trget set (two objectives). the chllenges of rndomness, uncertinty, nd multiple metrics. E. Geometric Approch To del with dynmic unknown Mrkovin environments with long-term verge rewrd vectors, Mnnor nd Shimkin [40] proposed geometric pproch to MORL. It is ssumed tht the ctions of other gents my influence the dynmics of the environment. Sufficient conditions for stte recurrence, i.e., the gme is irreducible or ergodic, re lso ssumed to be stisfied. In [40], using the proposed geometric-bsed ide, two MORL lgorithms, clled multiple directions RL (MDRL) nd single direction RL (SDRL), were presented to pproximte desired trget set in multidimensionl objective spce, s shown in Fig. 5. This trget set cn be viewed s the synthetic objective function TQ(s, ), which stisfies some geometric conditions of dynmic gme model. The MDRL nd SDRL lgorithms re bsed on the rechbility theory for stochstic gmes, nd the min mechnism behind these two lgorithms is to steer the verge rewrd vector to the trget set. When nested clss of trget sets is prescribed, Mnnor nd Shimkin [40] lso provided n extension of the geometric MORL lgorithm, where the gol of the lerning gent ws to pproch the trget set with the smllest size. A prticulr exmple of this cse is to solve constrined MDPs with verge rewrds. In ddition, the geometric lgorithms lso need some prior knowledge of the problem domin to define the trget set.

8 392 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 Algorithm 6 Convex Hull Algorithm [42] 1: Initilize ˆQ(s, ), for ll s,, rbitrrily; 2: while not converged do 3: for ll s S, [ A do 4: ˆQ(s, ) E r(s, ) + γ hull d ˆQ(s ], ) s, ; 5: end for 6: end while 7: Return ˆQ; TABLE I REPRESENTATIVE APPROACHES TO MORL F. Convex Hull Approch Brrett nd Nrynn [42] presented multiple-policy lgorithm to MORL, which cn simultneously lern optiml policies for ll liner preference ssignments in the objective spce. Two opertions on convex hulls re defined s follows. Definition 1 [42]: Trnsltion nd scling opertions u + bv { u + b v v V}. (15) Definition 2 [42]: Summing two convex hulls U + V hull { u + v u U, v V} (16) where u nd v re vectors in convex hulls U nd V, respectively. Eqution (15) indictes tht the convex hull of trnsformed (trnsltion nd scling) convex hull is just itself nd (16) indictes tht the convex hull of two summed convex hulls requires more computtion. Bsed on the definitions, the convex hull lgorithm is illustrted Algorithm 6, where ˆQ is the vertices of the convex hull of ll possible Q-vlue vectors. This lgorithm cn be viewed s n extension of stndrd RL lgorithms where the expected rewrds used to updte the vlue function re optiml for some liner preferences. Brrett nd Nrynn [42] presented this lgorithm nd proved tht the solution cn find the optiml policy for ny liner preference function. This solution cn be simplified s the stndrd vlue itertion lgorithm when specil weight vector is used. This convex-hull technique cn be integrted with other RL lgorithms. Since multiple policies re lerned t once, the integrted RL lgorithms should be off-policy lgorithms. G. Vrying Prmeter Approch Generlly, multiple-policy pproch cn be relized by performing multiple runs with different prmeters, objective thresholds, nd orderings in ny single-policy lgorithm. For exmple, s indicted in [12] nd [27], sclrized Q-lerning cn be used in multiple-policy mnner by executing repeted runs of the Q-lerning lgorithm using different prmeters. Shelton [43] pplied policy grdient methods nd the ide of vrying prmeters to the MORL domin. In the proposed pproch, fter estimting multiple policy grdients for ech objective, weighted grdient is obtined nd multiple policies cn be found by vrying the weights of the objective grdients. In summry, the representtive pproches mentioned bove to MORL cn be briefly described in Tble I. In ddition to these pproches, there re some other MORL pproches proposed recently [44], [45]. The lgorithm proposed in [44] is n extension of multiobjective fitted Q-itertion (FQI) [54] tht cn find control policies for ll the liner combintions of preferences ssigned to the objectives in single trining procedure. By performing single-objective FQI in the stte-ction spce, s well s the weight spce, the multiobjective FQI (MOFQI) lgorithm relizes function pproximtion nd generliztion of the ction-vlue function with vectored rewrds. In [45], different immedite rewrds were given for the visited sttes by compring the objective vector of the current stte with those of the Preto optiml solutions tht hve been computed. After keeping trck of the nondominted solutions nd constructing the Preto front t the end of the optimiztion process, the Preto optiml solutions cn be memorized in n elite list. V. IMPORTANT DIRECTIONS OF RECENT RESEARCH ON MORL Although the MORL pproches summrized in Section IV re very promising for further pplictions, there re severl importnt directions in which MORL pproches hve been improved recently. A. Further Development of MORL Approches In order to obtin suitble representtions of the preferences nd improve the efficiency of MORL lgorithms, lot of efforts hve been mde recently. Estimtion of distribution lgorithms (EDA) ws used by Hnd [46] tosolvemorl problems. By incorporting the notions in evolutionry MOO, the proposed method ws ble to cquire vrious strtegies

9 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 393 by single run. Studley nd Bull [47] investigted the performnce of lerning clssifier system for MORL, nd the results demonstrted tht the choice of ction-selection policies cn gretly ffect the performnce of the lerning system. Kei et l. [48] selected effective ction cndidtes by the α-domintion strtegy nd used gol-directed bis bsed on the chievement level of ech evlution. Zhng et l. [49] nd Zho nd Zhng [50] used prllel genetic lgorithm (PGA) to evolve neuro-controller, nd perturbtion stochstic pproximtion (SPSA) ws used to improve the convergence of the proposed lgorithm. The notion of dptive mrgins [51] ws dopted to improve the performnce of MORL lgorithms [52]. To obtin Preto optiml policies in lrge or continuous spces, some recent efforts include: Khmis nd Gom [53] developed multigent frmework for MORL in trffic signl control, Cstelletti et l. [54] mde use of the FQI for pproximting the Preto front, nd Wiering nd de Jong [55] proposed new vlue itertion lgorithm, clled consistency multi objective dynmic progrmming. Most of the MORL lgorithms mentioned bove belong to single-policy pproches nd they were studied nd tested independently. The min limittion is tht the smll scle of previous MORL problems my not verify the lgorithm s performnce in deling with wide rnge of different problem settings, nd the lgorithm implementtions lwys require much prior knowledge bout the problem domin. B. Dynmic Preferences in MORL Since mny rel-world optimiztion nd control problems re not sttionry, there re growing interests in solving dynmic optimiztion problems in recent yers. Similrly, the policies obtined by MORL pproches usully rely on the preferences on the rewrd vectors, so it is necessry to develop MORL lgorithms tht tke dynmic preferences into considertion. In optiml control with multiple objectives, the designer cn use the fixed-weight pproch [56] to determine the optimiztion direction. However, if there is no exct bckground knowledge of the problem domin, the fixed-weight pproch my find unstisfctory solutions. For MOO, the rndomweight pproch [57] nd the dptive pproch [58] were studied to compute the weights bsed on the objective dt so tht the mnul selection of objective weights cn be simplified. Nevertheless, it is hrd for these two pproches to express the preference of the designer. An importnt problem is to utomticlly derive the weights when the designer is unble to ssign the weights. This problem is usully clled preference elicittion (PE), nd its mjor relted work is inverse RL [59], [60]. In order to predict the future decisions of n gent from its previous decisions, Byesin pproch ws proposed [61] to lern the utility function. In [62], n pprenticeship lerning lgorithm ws presented in which observed behviors were used to lern the objective weights of the designer. Aiming t the time-vrying preferences mong multiple objectives, Ntrjn nd Tdeplli [63] proposed n MORL Algorithm 7 Algorithm Schem for Dynmic MORL [63] 1: Obtin the current weight vector w new,setδ s threshold vlue; 2: π init =rgmx π ( w new ρ π ) 3: Compute the vlue function vectors of π init 4: Compute the verge rewrd vector of π init 5: Lern the new policy π through vector-bsed RL; 6: If ( w new ρ π w new ρ πinit )>δ, dd π to the set of stored policies. method tht cn find nd keep finite number of policies which cn be ppropritely selected for vrying weight vectors. This lgorithm is bsed on the verge rewrd criteri, nd its schem is shown in Algorithm 7, where δ is tunble prmeter. The motivtion for this lgorithm is tht despite of infinitely mny weight vectors, the set of ll optiml policies my be well represented by smll number of optiml policies. C. Evlution of MORL Approches As reltively new field of study, reserch on MORL hs minly focused on vrious principles nd lgorithms to del with multiple objectives in sequentil decision-mking problems. Although it is desirble to develop stndrd benchmrk problems nd methods for the evlution of MORL lgorithms, there hs been little work on this topic except recent work by Vmplew et l. [12]. In previous studies on MORL, the lgorithms were usully evluted on different lerning problems so tht it is necessry to define some stndrd test problems with certin chrcteristics for rigorous performnce testing of MORL lgorithms. For exmple, mny MORL lgorithms were tested seprtely on some d hoc problems, including trffic tsks [64] [67], medicl tsks [68], robot tsks [69] [71], network routing tsks [72], grids tsks [73], nd so on. As indicted in [12], it is difficult to mke comprisons mong these lgorithms due to the lck of benchmrk test problems nd methodologies. Furthermore, for d hoc ppliction problems, the Preto fronts re usully unknown nd it is hrd to find bsolute performnce mesures for MORL lgorithms. Therefore, suite of benchmrk problems with known Preto fronts were proposed in [12], together with stndrd method for performnce evlution which cn serve s bsis for future comprtive studies. In ddition, two clsses of MORL lgorithms were evluted in [12] bsed on different evlution metrics. In prticulr, single policy MORL lgorithms were tested vi online lerning while offline lerning performnce ws tested for multipolicy pproches. VI. RELATED FIELDS OF STUDY MORL is highly interdisciplinry field, nd its relted fields minly include MOO, hierrchicl RL (HRL) nd multigent RL (MARL). Their reltions to MORL nd recent progress re discussed in this section. In ddition, by using keyword-bsed serch in Google scholr, the verge number of MORL-relted publictions ws bout 90 per yer from

10 394 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH to 2001, while this verge number becomes 700 per yer from 2011 to Thus, it cn be observed tht during the pst 10 yers, there is significnt increse in the number of publictions relted to MORL. A. MOO MOO problems usully hve no unique, optiml solution, which is different from SOO tht hs single best solution. For MOO nd MORL, set of noninferior, lterntive solutions clled the Preto optiml solutions cn be defined insted of single optiml solution. Nevertheless, the im of MOO is to solve prmeter optimiztion problem with multiple objectives, while MORL is to solve sequentil decision mking problems with multiple objectives. There re two common gols for MOO lgorithms: one is to find set of solutions tht is close to the Preto optiml front, nd the other is to obtin diverse set of solutions representing the whole Preto optiml front. These two gols hve lso been studied in MORL. In order to chieve these two gols, vriety of lgorithms hve been presented to solve MOO problems, such s multiobjective prticle swrm optimizer (MOPSO) [74], multiobjective genetic lgorithms (MOGA) [14], [75], etc. These MOO lgorithms hve improved mthemticl chrcteristics for solving vrious MOO problems. Nevertheless, when the dimension of MOO problems is high, mny of these lgorithms usully hve decresed performnce due to the difficulty of finding wide rnge of lterntive solutions. Moreover, it is hrd for MOGA nd MOPSO to solve MOO problems with concve Preto fronts which re populrly encountered in rel-world pplictions. Most MOO problems hve different kinds of constrints nd they re lso clled constrined MOO problems. Until now, vrious constrint hndling pproches [76] [79] hve been proposed to improve the performnce of multi objective evolutionry lgorithm nd MOPSO. B. HRL Hierrchicl RL (HRL) mkes use of divide-nd-conquer strtegy to solve complex tsks with lrge stte or decision spces. Unlike conventionl RL, HRL ims to solve sequentil decision-mking problem tht cn be best described s set of hierrchiclly orgnized tsks nd sub-tsks. MORL differs from HRL in tht the MORL problem requires the lerning gent to solve severl tsks with different objectives t once. The most outstnding dvntge of HRL is tht it cn scle to lrge nd complex problems [80]. One common feture of HRL nd MORL is tht there re multiple tsks tht need to be solved for the lerning gent. Erlier HRL lgorithms need prior knowledge bout the high-level structure of complex MDPs. There hve been some HRL pproches or formultions to incorporte prior knowledge: HAMs [81], MAXQ [82], options [83], nd ALisp [84]. The usge of the prior knowledge cn simplify the problem decomposition nd ccelerte the lerning process for good policies. Currently, most HRL lgorithms re bsed on the semi-mdp (SMDP) model. In [85], the SMDP frmework ws extended to concurrent ctivities, multigent domins, nd prtilly observble sttes. As discussed in [103], lthough vlue function pproximtors cn be integrted with HRL, few successful results hve been reported in the literture for pplying existing HRL pproches to MDPs with lrge or continuous spces. Furthermore, to utomticlly decompose the stte spce of MDPs or construct options is still difficult tsk. Recently some new HRL lgorithms were proposed. In [86], n HRL pproch ws presented where the stte spce cn be prtitioned by criticl stte. A hierrchicl pproximte policy itertion (HAPI) lgorithm with binry-tree stte spce decomposition ws presented in [87]. In the HAPI pproch, fter decomposing the originl MDP into multiple sub-mdps with smller stte spces, better ner-optiml locl policies cn be found nd the finl globl policy cn be derived by combining the locl policies in ech sub-mdp. For MORL problems, hierrchicl lerning rchitecture with multiple objectives ws proposed in [88]. The mjor ide of this pproch is to mke use of reference network so tht n internl reinforcement representtion cn be generted during the opertion of the lerning process. Furthermore, internl reinforcement signls from different levels cn be provided to represent multilevel objectives for the lerning system. C. MARL In MORL, the lerning gent ims to solve sequentil decision problems with rewrd vectors, nd multiple policies my be obtined seprtely through decomposition. For multiple policy pproches, the MORL problem cn be solved in distributed mnner which is closely relted to multigent RL (MARL) [89]. In MARL, ech gent my lso hve its own objective nd there my be multiple objectives in the lerning system. However, most of the efforts in designing MARL systems hve been focused on the communiction nd interction (coopertion, competition, nd mixed strtegies) mong gents. In MORL, usully there is no explicit shring of informtion mong objectives. The min tsk of n MARL system is tht utonomous multiple gents explicitly consider other gents nd coordinte their ction policies so tht coherent joint behvior cn be relized. In recent yers, the reserch results on MARL cn be viewed s combintion of temporl-difference lerning, gme theory, nd direct policy serch techniques. There re severl mjor difficulties in the reserch of MARL lgorithms, which re different from single-gent settings. The first difficulty is the existence of multiple equilibriums. In Mrkov gmes with single equilibrium vlue, the optiml policy cn be well defined. But when there re multiple equilibriums, MARL lgorithms hve to ensure the gents to coordinte their policies for selecting pproprite equilibriums. From n empiricl viewpoint, we should crefully consider the influence of multiple equilibriums on MARL lgorithms, nd undesirble equilibriums my be esily reched due to certin gme properties [90]. In the design of MARL lgorithms, one mjor im is to mke the gents policies converge to desirble equilibriums. There hve been mny heuristic explortion strtegies proposed in the literture so tht the probbility for

11 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 395 reching the optiml equilibriums cn be improved in identicl interest gmes [89] [92]. VII. CHALLENGES AND OPEN PROBLEMS In ddition to the three importnt spects for further development in Section V, there re severl chllenges nd open problems in MORL, which will be discussed in the sequel. A. Stte/Feture Representtion in MORL The problem of selecting n efficient stte or feture representtion in rel-world problem hs been n importnt topic in RL [4], [8]. In the erlier reserch on RL, there hs been some importnt progress in RL theory nd lgorithms for discrete-stte MDPs. However, most rel-world problems hve lrge or continuous stte spces, thus, the huge computtionl costs will mke erlier tbulr RL lgorithms be imprcticl for rel pplictions. In the pst decde, pproximte vlue functions or policies with feture representtions hve been widely studied. But one min obstcle is tht mny RL lgorithms nd theories with good convergence properties usully rely on mnully selected feture representtions. Thus, it is difficult to ccurtely pproximte the optiml vlue functions or policies without crefully selected fetures. Some recent dvnces in utomtic feture representtion include kernel methods nd grph Lplcin pproches for RL [93], [94]. In MORL, there re multiple objectives to be chieved nd multiple vlue functions my need to be pproximted simultneously. The feture representtion problem in MORL is more complicted due to the existence of multiple objectives or vlue functions. Therefore, stte or feture representtion in MORL is still big chllenge for further reserch. B. Vlue Function Approximtion in MORL As discussed bove, to solve MDPs with lrge or continuous stte spces, vlue function pproximtion (VFA) is key technique to relize generliztion nd improve lerning efficiency [95]. According to the bsic properties of function pproximtors, there re two different kinds of VFA methods, i.e., liner VFA [96], [97] nd nonliner VFA [9], [98], [99]. In some rel-world pplictions, multilyer neurl networks were commonly employed s the nonliner pproximtors for VFA. However, the empiricl results of successful RL pplictions using nonliner VFA commonly lck rigorous theoreticl nlysis. Some negtive results concerning divergence were reported for Q-lerning nd TD lerning bsed on direct grdient rules [4], [20]. Hence, mny RL pproches with VFA require significnt design efforts or problem insights, nd it is hrd to find bsis function set tht is both sufficiently simple nd sufficiently relible. To solve MDPs with lrge or continuous stte spces, MORL lgorithms lso require VFA to improve generliztion bility nd reduce computtionl costs. The dditionl representtion of the preferences mong different objectives is more difficult for developing VFA techniques, especilly when the MORL problem hs dynmic preferences. Hence, VFA becomes greter chllenge for MORL thn tht for stndrd RL. C. Convergence Anlysis of MORL Algorithms Both the Q-lerning lgorithm nd the Srs lgorithm hve some ttrctive qulities s bsic pproches to RL. The mjor dvntge is the fct tht they re gurnteed to converge to the optiml solution for single MDP with discrete stte nd ction spces. Suppose there re N objectives in n MORL problem, then this problem cn be considered s N sub-mdps ( sub-mdp is n MDP with one single objective). Hence, the convergence results of the lgorithm to solve this MORL problem not only depend on the convergence of ll the lgorithms to solve these sub-mdps but lso depend on the representtions of the preferences mong ll the objectives. The convergence of single-policy pproches to MORL cn be nlyzed bsed on the results of the RL lgorithms used to solve sub-mdps. However, the convergence of multiple-policy pproches hs to consider the wy how the Preto front is pproximted. In short, for sttionry preferences, the convergence nlysis of MORL lgorithms is minly dependent on the properties of the lerning lgorithms to solve MDPs nd the representtions of the preferences. For dynmic preferences, the dynmic chrcteristics must be considered dditionlly. So fr, the convergence of MORL lgorithms commonly lcks rigorous theoreticl results. D. MORL in Multigent Systems A multigent system (MAS) is system tht hs multiple intercting utonomous gents, nd there re incresing numbers of ppliction domins tht re more suitble to be solved by multigent system insted of centrlized single gent [88], [100]. MORL in multigent systems is very importnt reserch topic, due to the multiobjective nture of mny prcticl multigent systems. How to extend the lgorithms nd theories of MORL in single-gent systems to MORL in multigent systems is n open problem. In prticulr, to chieve severl objectives t once by the coopertion of multiple gents in multigent systems is difficult problem to be solved. If there re competitions in multigent systems, the MORL problem will become more chllenging. E. Applictions of MORL In recent yers, RL hs been pplied in vriety of fields [101] [104], but the reserch on MORL lgorithms is reltively new field of study, nd s such there re few relworld pplictions so fr. The following difficult problems should be studied before MORL cn be successfully pplied in rel-time complex systems. One is to develop efficient MORL lgorithms tht cn solve MDPs with lrge or continuous stte spces. The other is to evlute the performnce of different MORL lgorithms in prctice nd find the gp between theoreticl nlysis nd rel performnce. The third one is to consider the constrints from rel-world pplictions when developing theoreticl models for MORL nd to investigte how theoreticl results cn promote the development of new lgorithms nd mechnisms. Since in mny rel-world pplictions, there re multiple conflict objectives to be optimized simultneously, it cn be

12 396 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 expected tht MORL will find more ppliction domins with the development of new theory nd lgorithms. VIII. CONCLUSION In this pper, the bckground, bsic rchitectures, mjor reserch topics, nd nïve solutions of MORL were introduced t first, then severl representtive pproches were reviewed nd some importnt directions of recent reserch were discussed in detil. There re two min dvntges of MORL. One is tht MORL is very useful to improve the performnce of single objective RL by generting highly diverse Preto-optiml models for constructing policy ensembles in domins with multiple objectives. The second is tht MORL lgorithms cn relize trde-off between ccurcy nd interpretbility for sequentil decision-mking tsks. For single-policy pproches, the weighted sum nd W-lerning pproches re very simple to implement, but they cnnot express exctly the preferences of the designer. The AHP, rnking, nd geometric pproches my express the preferences more exctly, but they need more prior knowledge of the problem domin. For multiple-policy pproches, the convex hull lgorithm cn lern optiml policies for ll liner preference ssignments over the objective spce t once. The vrying prmeter pproch cn be esily implemented by performing multiple runs with different prmeters, objective threshold vlues, nd orderings in ny single-policy lgorithm. MORL pproches hve been improved recently in three importnt spects: enhncing their solution qulities, dpting dynmic preferences nd constructing evlution systems. The min chllenges nd open problems in MORL include vlue function pproximtion, feture representtion, convergence nlysis of lgorithms nd the ppliction of MORL to multigent systems nd rel world problems. It cn be expected tht there will be more nd more reserch progress towrd these directions. ACKNOWLEDGMENT The uthors would like to thnk the Associte Editors nd nonymous reviewers for their vluble comments nd suggestions, which gretly improved the qulity of this pper. REFERENCES [1] E. L. Thorndike, Animl Intelligence. Drien, CT, USA: Hlfner, [2] A. M. Turing, Computing mchinery nd intelligence, Mind, vol. 59, pp , Oct [3] L. P. Kelbling, M. L. Littmn, nd A. W. Moore, Reinforcement lerning: A survey, J. Artif. Intell. Res., vol. 4, no. 1, pp , [4] R. S. Sutton nd A. G. Brto, Reinforcement Lerning: An Introduction. Cmbridge, MA, USA: MIT Press, [5] R. S. Sutton, Lerning to predict by the method of temporl differences, Mch. Lern., vol. 3, no. 1, pp. 9 44, [6] C. J. C. H. Wtkins nd P. Dyn, Q-lerning, Mch. Lern., vol.8, no. 3, pp , [7] F. Y. Wng, H. Zhng, nd D. Liu, Adptive dynmic progrmming: An introduction, IEEE Comput. Intell. Mg., vol. 4, no. 2, pp , My [8] X. Xu, D. W. Hu, nd X. C. Lu, Kernel bsed lest-squres policy itertion for reinforcement lerning, IEEE Trns. Neurl Netw., vol. 18, no. 4, pp , Jul [9] R. H. Crites nd A. G. Brto, Elevtor group control using multiple reinforcement lerning gents, Mch. Lern., vol. 33, nos. 2 3, pp , [10] G. J. Tesuro, Prcticl issues in temporl difference lerning, Mch. Lern., vol. 8, nos. 3 4, pp , [11] N. Sprgue nd D. Bllrd, Multiple-gol reinforcement lerning with modulr Srs(0), in Proc. 18th Int. Joint Conf. Artif. Intell., 2003, pp [12] P. Vmplew, R. Dzeley, A. Berry, R. Issbekov, nd E. Dekker, Empiricl evlution methods for multiobjective reinforcement lerning lgorithms, Mch. Lern., vol. 84, nos. 1 2, pp , [13] G. Mitsuo nd C. Runwei, Genetic Algorithms nd Engineering Optimiztion. Beijing, Chin: Tsinghu Univ. Press, [14] I. Y. Kim nd O. L. de Weck, Adptive weighted sum method for multiobjective optimiztion: A new method for Preto front genertion, Struct. Multidiscipl. Optim., vol. 31, no. 2, pp , [15] A. Konk, D. W. Coitb, nd A. E. Smith, Multi-objective optimiztion using genetic lgorithms: A tutoril, Relib. Eng. Syst. Sfety, vol. 91, no. 9, pp , Sep [16] M. Yoon, Y. Yun, nd H. Nkym, Sequentil Approximte Multiobjective Optimiztion Using Computtionl Intelligence. Berlin, Germny: Springer, [17] J. G. Lin, On min-norm nd min-mx methods of multi-objective optimiztion, Mth. Progrm., vol. 103, no. 1, pp. 1 33, [18] D. M. Roijers, P. Vmplew, S. Whiteson, nd R. Dzeley, A survey of multi-objective sequentil decision-mking, J. Artif. Intell. Res., vol. 48, no. 1, pp , Oct [19] J. Si, A. Brto, W. Powell, nd D. Wunsch, Hndbook of Lerning nd Approximte Dynmic Progrmming. Hoboken, NJ, USA: Wiley-IEEE Press, [20] T. Jkkol, M. I. Jordn, nd S. P. Singh, On the convergence of stochstic itertive dynmic progrmming lgorithms, Neurl Comput., vol. 6, no. 6, pp , Nov [21] S. P. Singh, T. Jkkol, M. L. Littmn, nd C. Szepesvri, Convergence results for single-step on-policy reinforcement lerning lgorithms, Mch. Lern., vol. 38, no. 3, pp , Mr [22] A. Schwrtz, A reinforcement lerning method for mximizing undiscounted rewrds, in Proc. 10th Int. Conf. Mch. Lern., 1993, pp [23] H. L. Lio, Q. H. Wu, nd L. Jing, Multi-objective optimiztion by reinforcement lerning for power system disptch nd voltge stbility, in Proc. Innov. Smrt Grid Technol. Conf. Eur., Gothenburg, Sweden, 2010, pp [24] K. Sindhy, Hybrid evolutionry multi-objective optimiztion with enhnced convergence nd diversity, Ph.D. disserttion, Dept. Mth. Inf. Tech., Univ. Jyvskyl, Jyvskyl, Finlnd, [25] P. Vmplew, J. Yerwood, R. Dzeley, nd A. Berry, On the limittions of sclristion for multi-objective reinforcement lerning of Preto fronts, in Proc. 21st Aust. Joint Conf. Artif. Intell., vol , pp [26] E. Zitzler, L. Thiele, M. Lumnns, C. M. Fonsec, nd V. G. d Fonsec, Performnce ssessment of multiobjective optimizers: An nlysis nd review, IEEE Trns. Evol. Comput.,vol.7, no. 2, pp , Apr [27] A. Cstelletti, G. Corni, A. Rizzolli, R. Soncinie-Sess, nd E. Weber, Reinforcement lerning in the opertionl mngement of wter system, in Proc. IFAC Workshop Model. Control Environ. Issues, Yokohm, Jpn, 2002, pp [28] J. Krlsson, Lerning to solve multiple gols, Ph.D. disserttion, Dept. Comput. Sci., Univ. Rochester, Rochester, NY, USA, [29] D. C. K. Ngi nd N. H. C. Yung, A multiple gol reinforcement lerning method for complex vehicle overtking mneuvers, IEEE Trns. Intell. Trnsp. Syst., vol. 12, no. 2, pp , Jun [30] F. Zeng, Q. Zong, Z. Sun, nd L. Dou, Self-dptive multi-objective optimiztion method design bsed on gent reinforcement lerning for elevtor group control systems, in Proc. 8th World Congr. Int. Control Autom., Jinn, Chin, 2010, pp [31] G. Tesuro et l., Mnging power consumption nd performnce of computing systems using reinforcement lerning, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 2007, pp [32] M. Humphrys, Action selection methods using reinforcement lerning, in From Animls to Animts 4, P. Mes, M. Mtric, J.-A. Meyer, J. Pollck, nd S. W. Wilson, Eds. Cmbridge, MA, USA: MIT Press, 1996, pp

13 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 397 [33] X. N. Shen, Y. Guo, Q. W. Chen, nd W. L. Hu, A multi-objective optimiztion genetic lgorithm incorporting preference informtion, Inf. Control, vol. 36, no. 6, pp , [34] Y. Zho, Q. W. Chen, nd W. L. Hu, Multi-objective reinforcement lerning lgorithm for MOSDMP in unknown environment, in Proc. 8th World Congr. Int. Control Autom., 2010, pp [35] L. G. Mitten, Composition principles for synthesis of optimum multistge processes, Oper. Res., vol. 12, pp , Aug [36] M. J. Sobel, Ordinl dynmic progrmming, Mnge. Sci., vol. 21, pp , My [37] Z. Gbor, Z. Klmr, nd C. Szepesvri, Multi-criteri reinforcement lerning, in Proc. 15th Int. Conf. Mch. Lern., 1998, pp [38] P. Geibel, Reinforcement lerning with bounded risk, in Proc. 18th Int. Conf. Mch. Lern., 2001, pp [39] K. Zheng, H. Li, R. C. Qiu, nd S. Gong, Multi-objective reinforcement lerning bsed routing in cognitive rdio networks: Wlking in rndom mze, in Proc. Int. Conf. Comput. Netw. Commun., 2012, pp [40] S. Mnnor nd N. Shimkin, A geometric pproch to multi-criterion reinforcement lerning, J. Mch. Lern. Res., vol. 5, pp , Jn [41] S. Mnnor nd N. Shimkin, The steering pproch for multi-criteri reinforcement lerning, in Proc. Adv. Neurl Inf. Process. Syst., 2001, pp [42] L. Brrett nd S. Nrynn, Lerning ll optiml policies with multiple criteri, in Proc. 25th Int. Conf. Mch. Lern., 2008, pp [43] C. R. Shelton, Blncing multiple sources of rewrd in reinforcement lerning, in Proc. Adv. Neurl Inf. Process. Syst., 2000, pp [44] A. Cstelletti, F. Pinosi, nd M. Restelli, Tree-bsed fitted Q-itertion for multi-objective Mrkov decision problems, in Proc. Int. Joint Conf. Neurl Netw., 2012, pp [45] H. L. Liu nd Q. H. Wu, Multi-objective optimiztion by reinforcement lerning, in Proc. IEEE Congr. Evol. Comput., 2010, pp [46] H. Hnd, Solving multi-objective reinforcement lerning problems by EDA-RL-cquisition of vrious strtegies, in Proc. 9th Int. Conf. Int. Syst. Design Appl., 2009, pp [47] M. Studley nd L. Bull, Using the XCS clssifier system for multiobjective reinforcement lerning, Artif. Life, vol. 13, no. 1, pp , [48] A. Kei, S. Jun, A. Tknobu, I. Kokoro, nd K. Shigenobu, Multicriteri reinforcement lerning bsed on gol-directed explortion nd its ppliction to bipedl wlking robot, Trns. Inst. Syst. Control Inf. Eng., vol. 18, no. 10, pp , [49] H. J. Zhng, J. Zho, R. Wng, nd T. M, Multi-objective reinforcement lerning lgorithm nd its ppliction in drive system, in Proc. 34th Annu. IEEE Conf. Ind. Electron., Orlndo, FL, USA, 2008, pp [50] J. Zho nd H. J. Zhng, Multi-objective reinforcement lerning lgorithm nd its improved convergency method, in Proc. 6th IEEE Conf. Ind. Electron. Appl., Beijing, Chin, 2011, pp [51] K. Hirok, M. Yoshid, nd T. Mishim, Prllel reinforcement lerning for weighted multi-criteri model with dptive mrgin, Cogn. Neurodyn., vol. 3, pp , Mr [52] X. L. Chen, X. C. Ho, H. W. Lin, nd T. Murt, Rule driven multi objective dynmic scheduling by dt envelopment nlysis nd reinforcement lerning, in Proc. IEEE Int. Conf. Autom. Logist., Hong Kong, 2010, pp [53] M. A. Khmis nd W. Gom, Adptive multi-objective reinforcement lerning with hybrid explortion for trffic signl control bsed on coopertive multi-gent frmework, Eng. Appl. Artif. Intell., vol. 29, pp , Mr [54] A. Cstelletti, F. Pinosi, nd M. Restelli, Multi-objective fitted Q-itertion: Preto frontier pproximtion in one single run, in Proc. IEEE Int. Conf. Netw. Sens. Control, Delft, The Netherlnds, 2011, pp [55] M. A. Wiering nd E. D. de Jong, Computing optiml sttionry policies for multi-objective Mrkov decision processes, in Proc. IEEE Int. Symp. Approx. Dyn. Progrm. Reinf. Lern., Honolulu, HI, USA, 2007, pp [56] M. Oubbti, P. Levi, nd M. Schnz, A fixed-weight RNN dynmic controller for multiple mobile robots, in Proc. 24th IASTED Int. Conf. Model. Identif. Control, 2005, pp [57] C. Q. Zhng, J. J. Zhng, nd X. H. Gu, The ppliction of hybrid genetic prticle swrm optimiztion lgorithm in the distribution network reconfigurtions multi-objective optimiztion, in Proc. 3rd Int. Conf. Nt. Comput., vol , pp [58] D. Zheng, M. Gen, nd R. Cheng, Multiobjective optimiztion using genetic lgorithms, Eng. Vl. Cot Anl., vol. 2, pp , [59] A. Y. Ng nd S. Russell, Algorithms for inverse reinforcement lerning, in Proc. 17th Int. Conf. Mch. Lern., 2000, pp [60] C. Boutilier, A POMDP formultion of preference elicittion problems, in Proc. 18th Nt. Conf. Artif. Intell., 2002, pp [61] U. Chjewsk, D. Koller, nd D. Ormoneit, Lerning n gent s utility function by observing behvior, in Proc. 18th Int. Conf. Mch. Lern., 2001, pp [62] P. Abbeel nd A. Y. Ng, Apprenticeship lerning vi inverse reinforcement lerning, in Proc. 21st Int. Conf. Mch. Lern., 2004, pp [63] S. Ntrjn nd P. Tdeplli, Dynmic preferences in multi-criteri reinforcement lerning, in Proc. 22nd Int. Conf. Mch. Lern., 2005, pp [64] M. A. Khmis, W. Gom, Adptive multi-objective reinforcement lerning with hybrid explortion for trffic signl control bsed on coopertive multi-gent frmework, Eng. Appl. Artif. Intell., vol. 29, pp , [65] Z. H. Yng nd K. G. Wen, Multi-objective optimiztion of freewy trffic flow vi fuzzy reinforcement lerning method, in Proc. 3rd Int. Conf. Adv. Comput. Theory Eng., vol. 5, 2010, pp [66] K. G. Wen, W. G. Yng, nd S. R. Qu, Efficiency nd equity bsed freewy trffic network flow control, in Proc. 2nd Int. Conf. Comput. Autom. Eng., vol , pp [67] H. L. Dun, Z. H. Li, nd Y. Zhng, Multi-objective reinforcement lerning for trffic signl control using vehiculr d hoc network, EURASIP J. Adv. Signl Process., vol. 2010, pp. 1 7, Mr. 2010, Art. ID [68] R. S. H. Istepnin, N. Y. Philip, nd M. G. Mrtini, Medicl QoS provision bsed on reinforcement lerning in ultrsound streming over 3.5G wireless systems, IEEE J. Select. Ares Commun., vol. 27, no. 4, pp , My [69] Y. Nojim, F. Kojim, nd N. Kubot, Locl episode-bsed lerning of multi-objective behvior coordintion for mobile robot in dynmic environments, in Proc. 12th IEEE Int. Conf. Fuzzy Syst., vol , pp [70] D. C. K. Ngi nd N. H. C. Yung, Automted vehicle overtking bsed on multiple-gol reinforcement lerning frmework, in Proc. IEEE Int. Conf. Control Appl., Settle, WA, USA, 2010, pp [71] T. Miyzki, An evlution pttern genertion scheme for electric components in hybrid electric vehicles, in Proc. 5th IEEE Int. Conf. Int. Syst., Yokohm, Jpn, 2010, pp [72] A. Petrowski, F. Aissnou, I. Benyhi, nd S. Houcke, Multicriteri reinforcement lerning bsed on Russin doll method for network routing, in Proc. 5th IEEE Int. Conf. Intell. Syst., 2010, pp [73] J. Perez, C. Germin-Renud, B. Kegl, nd C. Loomis, Multi-objective reinforcement lerning for responsive grids, J. Grid Comput., vol. 8, no. 3, pp , [74] L. V. S. Quintero, N. R. Sntigo, nd C. A. C. Coello, Towrds more efficient multi-objective prticle swrm optimizer, in Multi- Objective Optimiztion in Computtionl Intelligence: Theory nd Prctice. Hershey, PA, USA: IGI Globl, 2008, pp [75] H. Li nd Q. F. Zhng, MOEA/D: A multiobjective evolutionry lgorithm bsed on decomposition, IEEE Trns. Evol. Comput., vol. 11, no. 6, pp , Dec [76] G. Tsggouris nd C. Zroligis, Multiobjective optimiztion: Improved FPTAS for shortest pths nd non-liner objectives with pplictions, Theory Comput. Syst., vol. 45, no. 1, pp , [77] J. P. Dubus, C. Gonzles, nd P. Perny, Multiobjective optimiztion using GAI models, in Proc. Int. Conf. Artif. Intell., 2009, pp [78] P. Perny nd O. Spnjrd, Ner dmissible lgorithms for multiobjective serch, in Proc. Eur. Conf. Artif. Intell., 2008, pp [79] G. G. Yen nd W. F. Leong, A multiobjective prticle swrm optimizer for constrined optimiztion, Int. J. Swrm Intell. Res., vol. 2, no. 1, pp. 1 23, [80] N. Dethlefs nd H. Cuyhuitl, Hierrchicl reinforcement lerning for dptive text genertion, in Proc. 6th Int. Conf. Nt. Lng. Gener., 2010, pp [81] R. Prr nd S. Russell, Reinforcement lerning with hierrchies of mchines, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 1997, pp

398 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 [82] T. Dietterich, Hierrchicl reinforcement lerning with the MxQ vlue function decomposition, J. Artif.

Cmbridge, MA, USA: MIT Press, 1998, pp. 1050 1056. [84] D. Andre nd S. Russell, Stte bstrction for progrmmble reinforcement lerning gents, in Proc. 18th Nt. Conf. Artif. Intell., 2002, pp. 119 125.

Jin, Prtitioning the stte spce by criticl sttes, in Proc. 4th Int. Conf. Bio-Inspired Comput., 2009, pp. 1 7. [87] X. Xu, C. Liu, S. Yng, nd D.

14 398 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 [82] T. Dietterich, Hierrchicl reinforcement lerning with the MxQ vlue function decomposition, J. Artif. Intell. Res., vol. 13, no. 1, pp , Aug [83] D. Precup nd R. Sutton, Multi-time models for temporlly bstrct plnning, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 1998, pp [84] D. Andre nd S. Russell, Stte bstrction for progrmmble reinforcement lerning gents, in Proc. 18th Nt. Conf. Artif. Intell., 2002, pp [85] A. G. Brto nd S. Mhdevn, Recent dvnces in hierrchicl reinforcement lerning, Discrete Event Dyn. Syst. Theory Appl., vol. 13, nos. 1 2, pp , [86] Z. Jin, W. Y. Liu, nd J. Jin, Prtitioning the stte spce by criticl sttes, in Proc. 4th Int. Conf. Bio-Inspired Comput., 2009, pp [87] X. Xu, C. Liu, S. Yng, nd D. Hu, Hierrchil pproximte policy itertion with binry-tree stte spce decomposition, IEEE Trns. Neurl Netw., vol. 22, no. 12, pp , Dec [88] H. B. He nd B. Liu, A hierrchicl lerning rchitecture with multiple-gol representtions bsed on dptive dynmic progrmming, in Proc. Int. Conf. Netw. Sens. Control, 2010, pp [89] L. Busoniu, R. Bbusk, nd B. De Schutter, A comprehensive survey of multi-gent reinforcement lerning, IEEE Trns. Syst., Mn, Cybern.C,Appl.Rev., vol. 38, no. 2, pp , Mr [90] C. Clus nd C. Boutilier, The dynmics of reinforcement lerning in coopertive multigent systems, in Proc. 15th Nt. Conf. Artif. Intell., 1998, pp [91] S. Devlin nd D. Kudenko, Theoreticl considertions of potentilbsed rewrd shping for multi-gent systems, in Proc. 10th Annu. Int. Conf. Auton. Agents Multigent Syst., 2011, pp [92] F. Leon, Evolving equilibrium policies for multigent reinforcement lerning problem with stte ttrctors, in Proc. Int. Conf. Comput. Collect. Intell., Gdyni, Polnd, 2011, pp [93] X. Xu, Z. Hou, C. Lin, nd H. He, Online lerning control using dptive critic designs with sprse kernel mchines, IEEE Trns. Neurl Netw. Lern. Syst., vol. 24, no. 5, pp , My [94] S. Mhdevn nd M. Mggioni, Proto-vlue functions: A Lplcin frmework for lerning representtion nd control in Mrkov decision processes, J. Mch. Lern. Res., vol. 8, pp , Jn [95] D. Liu, Y. Zhng, nd H. Zhng, A self-lerning cll dmission control scheme for CDMA cellulr networks, IEEE Trns. Neurl Netw., vol. 16, no. 5, pp , Sep [96] X. Xu, H. G. He, nd D. W. Hu, Efficient reinforcement lerning using recursive lest-squres methods, J. Artif. Intell. Res., vol. 16, pp , Apr [97] J. Boyn, Technicl updte: Lest-squres temporl difference lerning, Mch. Lern., vol. 49, nos. 2 3, pp , [98] G. Tesuro, TD-Gmmon, self-teching bckgmmon progrm, chieves mster-level ply, Neurl Comput., vol.6,no.2, pp , Mr [99] W. Zhng nd T. Dietterich, A reinforcement lerning pproch to job-shop scheduling, in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, pp [100] J. Wu et l., A novel multi-gent reinforcement lerning pproch for job scheduling in grid computing, Future Gen. Comput. Syst., vol. 27, no. 5, pp , [101] H. S. Ahn et l., An optiml stellite ntenn profile using reinforcement lerning, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 41, no. 3, pp , My [102] F. Bernrdo, R. Agustí, J. Pérez-Romero, nd O. Sllent, Intercell interference mngement in OFDMA networks: A decentrlized pproch bsed on reinforcement lerning, IEEE Trns. Syst., Mn, Cybern.C,Appl.Rev., vol. 41, no. 6, pp , Nov [103] S. Adm, L. Busoniu, nd R. Bbusk, Experience reply for rel-time reinforcement lerning control, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 42, no. 2, pp , Mr [104] F. Hernndez-del-Olmo, E. Gudioso, nd A. Nevdo, Autonomous dptive nd ctive tuning up of the dissolved oxygen setpoint in wstewter tretment plnt using reinforcement lerning, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 42, no. 5, pp , Sep [105] R. Issbekov nd P. Vmplew, An empiricl comprison of two common multiobjective reinforcement lerning lgorithms, in Proc. 25th Int. Austrls. Joint Conf., Sydney, NSW Austrli, 2012, pp Chunming Liu received the B.Sc. nd M.Sc. degrees from the Ntionl University of Defence Technology, Chngsh, Chin, in 2004 nd 2006, respectively, where he is currently pursuing the Ph.D. degree. His current reserch interests include intelligent systems, mchine lerning, nd utonomous lnd vehicles. Xin Xu (M 07 SM 12) received the B.S. degree in electricl engineering from the Ntionl University of Defense Technology (NUDT), Chngsh, Chin, in 1996, where he received the Ph.D. degree in control engineering from the College of Mechtronics nd Automtion, in He is currently Full Professor with the Institute of Unmnned Systems, College of Mechtronics nd Automtion, NUDT. He hs been Visiting Scientist for Coopertion Reserch with Hong Kong Polytechnic University, Hong Kong, University of Albert, Edmonton, AB, Cnd, University of Guelph, Guelph, ON, Cnd, nd University of Strthclyde, Glsgow, U.K. His current reserch interests include reinforcement lerning, pproximte dynmic progrmming, mchine lerning, robotics, nd utonomous vehicles. He hs uthored or couthored over 100 ppers in interntionl journls nd conferences, nd hs couthored four books. He currently serves s n Associte Editor for the Informtion Sciences journl, nd Guest Editor of the Interntionl Journl of Adptive Control nd Signl Processing. Dr. Xu ws the recipient of the 2nd Clss Ntionl Nturl Science Awrd of Chin in 2012 nd the Fork Ying Tong Youth Techer Fund of Chin in He is Committee Member of the IEEE Technicl Committee on Approximte Dynmic Progrmming nd Reinforcement Lerning nd the IEEE Technicl Committee on Robot Lerning. He ws PC Member or Session Chir of vrious interntionl conferences. Dewen Hu (SM 09) ws born in Hunn, Chin, in He received the B.Sc. nd M.Sc. degrees from Xi n Jiotong University, Xi n, Chin, in 1983 nd 1986, respectively, nd the Ph.D. degree from the Ntionl University of Defense Technology, Chngsh, Chin, in From 1986, he ws with the Ntionl University of Defense Technology. From 1995 to 1996, he ws Visiting Scholr with the University of Sheffield, Sheffield, U.K., nd ws promoted s Professor in His current reserch interests include imge processing, system identifiction nd control, neurl networks, nd cognitive science. Dr. Hu ws the recipient of the 2nd Clss Ntionl Nturl Science Awrd of Chin in He is n Action Editor of Neurl Networks.

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm