REINFORCEMENT learning (RL) was originally studied

Size: px
Start display at page:

Download "REINFORCEMENT learning (RL) was originally studied"

Transcription

1 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH Multiobjective Reinforcement Lerning: A Comprehensive Overview Chunming Liu, Xin Xu, Senior Member, IEEE, nd Dewen Hu, Senior Member, IEEE Abstrct Reinforcement lerning (RL) is powerful prdigm for sequentil decision-mking under uncertinties, nd most RL lgorithms im to mximize some numericl vlue which represents only one long-term objective. However, multiple longterm objectives re exhibited in mny rel-world decision nd control systems, so recently there hs been growing interest in solving multiobjective reinforcement lerning (MORL) problems where there re multiple conflicting objectives. The im of this pper is to present comprehensive overview of MORL. The bsic rchitecture, reserch topics, nd nïve solutions of MORL re introduced t first. Then, severl representtive MORL pproches nd some importnt directions of recent reserch re comprehensively reviewed. The reltionships between MORL nd other relted reserch re lso discussed, which include multiobjective optimiztion, hierrchicl RL, nd multigent RL. Moreover, reserch chllenges nd open problems of MORL techniques re suggested. Index Terms Mrkov decision process (MDP), multiobjective reinforcement lerning (MORL), Preto front, reinforcement lerning (RL), sequentil decision-mking. I. INTRODUCTION REINFORCEMENT lerning (RL) ws originlly studied from the perspective of niml lerning behviors [1], nd it hs become mjor clss of mchine lerning methods [2] to solve sequentil decision mking problems under uncertinties [3], [4]. In n RL system, lerning gent ims to lern n optiml ction policy vi interctions with n uncertin environment. At ech step, the lerning gent is not provided explicitly wht ction to tke, nd insted it must determine the best ction to mximize long-term rewrds nd execute it. Then the selected ction mkes the current stte of the environment to trnsit into its successive stte, nd the gent receives sclr rewrd signl tht evlutes the effect of this stte trnsition, s shown in Fig. 1. Thus, there is feedbck rchitecture in lerning system bsed on RL nd the interction between the lerning gent nd its environment cn Mnuscript received Februry 16, 2014; revised June 11, 2014; ccepted August 15, Dte of publiction October 8, 2014; dte of current version Februry 12, This work ws supported in prt by the Progrm for New Century Excellent Tlents in Universities under Grnt NCET nd in prt by the Ntionl Fundmentl Reserch Progrm of Chin under Grnt 2013CB This pper ws recommended by Associte Editor A. H. Tn. C. Liu is with the College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin. X. Xu is with the Institute of Unmnned Systems, College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin (e-mil: xinxu@nudt.edu.cn). D. Hu is with the Deprtment of Automtic Control, College of Mechtronics nd Automtion, Ntionl University of Defense Technology, Chngsh , Chin (e-mil: dwhu@nudt.edu.cn). Digitl Object Identifier /TSMC Fig. 1. Bsic RL scenrio. be described by sequence of sttes, ctions, nd rewrds. This sequentil decision process is usully modeled s Mrkov decision process (MDP). The rule or strtegy for ction selection is clled policy. In RL, the gent lerns optiml or ner-optiml ction policies from such interctions in order to mximize some notion of long-term objectives. In the pst decdes, there hs been lrge number of works on RL theory nd lgorithms [5] [8]. By focusing on the computtionl efforts long stte trnsition trjectories nd using function pproximtion techniques for estimting vlue functions or policies, RL lgorithms hve produced good results in some chllenging rel-world problems [9], [10]. However, despite of mny dvnces in RL theory nd lgorithms, one remined chllenge is to scle up to lrger nd more complex problems. The scling problem for sequentil decision-mking minly includes the following spects [11]. A problem tht hs very lrge or continuous stte or ction spce, problem tht is best described s set of hierrchiclly orgnized tsks nd sub-tsks, nd problem tht needs to solve severl tsks with different rewrds simultneously. An RL problem in the lst spect is clled multiobjective RL (MORL) problem, which refers to the sequentil decision mking problem with multiple objectives. MORL hs been regrded s n importnt reserch topic, due to the multiobjective chrcteristics of mny prcticl sequentil decision-mking nd dptive optiml control problems in the rel world. Compred with conventionl RL problems, MORL problems require lerning gent to obtin ction policies tht cn optimize two or more objectives t the sme time. In MORL, ech objective hs its own ssocited rewrd signl, so the rewrd is not sclr vlue but vector. When ll the objectives re directly relted, single objective cn be derived by combining the multiple objectives together. If ll the objectives re completely unrelted, they cn be optimized seprtely nd we cn find combined policy to optimize ll of them. However, if there re conflicting objectives, ny policy cn only mximize one of the objectives, or relize trde-off mong the conflicting objectives [12]. Therefore, MORL cn be viewed s the combintion c 2014 IEEE. Personl use is permitted, but republiction/redistribution requires IEEE permission. See for more informtion.

2 386 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 of multiobjective optimiztion (MOO) nd RL techniques to solve the sequentil decision mking problems with multiple conflicting objectives. In the MOO domin, there re two common strtegies [13]: one is the multiobjective to single-objective strtegy nd the other is the Preto strtegy. The former strtegy is to optimize sclr vlue, including the weighted sum method [14], the constrint method [15], the sequentil method [16], nd the mx-min method, etc. [17]. In these methods, sclr vlue is computed from the multiple objectives for the utility of n ction decision, nd the conventionl single-objective optimiztion (SOO) techniques cn be used. The ltter strtegy is to use the vector-vlued utilities. In this cse, it is difficult to order the cndidte solutions completely, nd the Preto optimlity concept is usully used. The Preto optiml solutions re defined s noninferior nd lterntive solutions mong the cndidte solutions, nd they represent the optiml solutions for some possible trde-offs mong the multiple conflicting objectives [12]. All Preto optiml solutions constitute the Preto front nd one mjor reserch issue of MOO is to find or pproximte the Preto front. Similr to the MOO domin, MORL lgorithms cn be divided into two clsses bsed on the number of lerned policies. One clss comprises single-policy MORL pproches nd the other, multiple-policy MORL pproches [12], [18]. Single-policy pproches im to find the best single policy which represents the preferences mong the multiple objectives s specified by user or derived from the problem domin. The mjor difference mong single-policy pproches is the wy for determining nd expressing these preferences. The im of multiple-policy MORL pproches is to find set of policies tht pproximte the Preto front. The min difference mong multiple-policy pproches is the pproximtion scheme for the Preto front. Mjor pproches to MORL will be further discussed in Section IV. Although there hve been some recent dvnces in different MORL lgorithms, mny reserch chllenges still remin in developing MORL theory nd lgorithms for rel-world problems. In ddition, ccording to the uthors knowledge, there is only one relted survey pper tht hs been published in the literture. However, it covers the much broder topic of multiobjective sequentil decision-mking [18]. Therefore, the im of this pper is to provide comprehensive review of MORL principles, lgorithms nd some open problems. A representtive set of MORL pproches re selected to show the overll frmework of the field, to present summry of mjor chievements, nd to suggest some open problems for future reserch. The reminder of this pper is orgnized s follows. In Section II, the bckground of MORL is briefly introduced, including MDP, RL, nd MOO. The bsic rchitecture, reserch topics, nd nïve solutions of MORL re described in Section III. A representtive set of pproches to MORL re reviewed in Section IV. In Section V, some importnt directions of recent reserch on MORL re discussed in detil. The relted works re introduced in Section VI. Section VII nlyzes the chllenges nd open problems of MORL. Section VIII concludes this comprehensive overview. II. BACKGROUND In this section, the necessry bckgrounds on MDP models, RL techniques, nd MOO problems re introduced. Firstly, n MDP is chrcterized s the formultion of sequentil decision-mking problem. Then, some bsic RL techniques re introduced, where the discussion is restricted to finite stte nd ction spces, since most MORL results up to now re given for finite spces. At lst, the MOO problem is introduced, s well s the concept of Preto optimlity. A. MDP Models A sequentil decision-mking problem cn be formulted s n MDP which is defined s 4-tuple {S, A, R, P}. In this 4-tuple, S is the stte spce of finite set of sttes, A is the ction spce of finite set of ctions, R is the rewrd function nd P is the mtrix of stte trnsition probbility. After stte trnsition from stte s to stte s when tking ction, p(s,, s ) nd r(s,, s ) represent the probbility nd the rewrd of the stte trnsition, respectively. An ction policy of the MDP is defined s function π : S Pr(A), where Pr(A) is probbility distribution in A. Due to the different influences of future rewrds on the present vlue, there re two different objective functions of n MDP. One is the discounted rewrd criteri, which is to estimte the optiml policy π stisfying the following eqution: [ ] J π = mx J π = mx E π γ t r t (1) π π where γ (1 >γ >0) is the discount fctor nd r t =r(x t, t ) is the rewrd t time step t, E π [ ] stnds for the expecttion with respect to the policy π nd the probbility mtrix P, nd J π is the expected totl rewrd. The other one is clled the verge rewrd criteri, which is to estimte the optiml policy π stisfying the following eqution: ρ π = mx ρ π = mx π π { lim n t=0 1 n 1 n t=0 E π [r t ] where ρ π is the verge rewrd per time step for the policy π. For the discounted rewrd criteri, the stte vlue function nd the stte-ction vlue function for policy π re defined by [ ] V π (s) = E π γ t r t s 0 = s (3) t=0 } (2) [ ] Q π (s, ) = E π γ t r t s 0 = s, 0 =. (4) t=0 According to the theory of dynmic progrmming (DP) [20], the following Bellmn equtions re stisfied: V π [ (s) = E π r(s, ) + γ V π ( s )] (5) Q π [ (s, ) = R(s, ) + γ E π Q π ( s, )] (6) where R(s, ) is the expected rewrd received fter tking ction in stte s, s is the successive stte of s, nd π(s, ) represents the probbility of ction tken by policy π in stte s.

3 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 387 The optiml stte-ction vlue function is defined s Q (s, ) = mx Q π (s, ). π When Q*(s,) is obtined, the optiml policy π* cn be computed by π (s) = rg mx Q (s, ) where the optiml policy π* is deterministic policy, nd it is projection from S to A. For the verge rewrd criteri, let ρ π [see (2)] be the verge rewrd per time step for policy π. The reltive stte vlue function nd the reltive stte-ction vlue function re defined s [4] V π (s) = E π [r t ρ π s 0 = s] Q π (s, ) = t=0 E π [r t ρ π s 0 = s, 0 = ] t=0 nd the following Bellmn equtions re stisfied: V π [ (s) = E π rt + V π ( s )] ρ π Q π (s, ) = R(s, ) ρ π + E π [ Q π ( s, )]. (7) The optiml reltive stte-ction vlue function for verge rewrd settings stisfies Q { (s, )+ρ π = mx Q π } (s, )+ρ π π nd the optiml policy π cn lso be obtined by π (s) = rg mx Q (s, ). B. Bsic RL Algorithms The erlier pproch to solve MDPs with model informtion is to use the dynmic progrmming (DP) techniques, which compute the optiml policies by estimting the optiml stte-ction vlue functions. However, trditionl DP lgorithms commonly require full model informtion nd lrge mounts of computtion re needed for lrge stte nd ction spces. Different from DP, RL gent lerns n optiml or ner-optiml policy by intercting with the environment whose dynmic model is ssumed to be unknown [4], [19]. As indicted in [4], bsed on the observed stte trnsition dt of MDPs, RL lgorithms integrte the techniques of Monte Crlo, stochstic pproximtion, nd function pproximtion to obtin pproximte solutions of MDPs. As centrl mechnism of RL, temporl-difference (TD) lerning [5] cn be viewed s combintion of Monte Crlo nd DP. On one hnd, like Monte Crlo methods, TD lgorithms cn lern the vlue functions using stte trnsition dt without model informtion. On the other hnd, similr to DP, before finl outcome is obtined, TD methods cn updte the current estimtion of vlue functions prtilly bsed on previous lerned results [4], [5]. For the discounted rewrd criteri, Q-lerning nd Srs re the most widely used tbulr RL lgorithms. The Q-lerning lgorithm is shown in Algorithm 1, where α is the lerning Algorithm 1 Q-Lerning Algorithm [4], [6] \\N: The mximum number of episodes 1: Initilize Q(s, ) rbitrrily; 2: repet (for ech episode i) 3: Initilize s; 4: repet (for ech step of episode) 5: Choose from s using policy derived from Q(s, ); 6: Tke ction, observe r, s ; 7: Q(s, ) Q(s, )+α[r +γ mx Q(s, ) Q(s, )]; 8: s s ; 9: until s is terminl 10: until i = N Algorithm 2 R-Lerning Algorithm [4], [22] \\ρ: The verge rewrd \\N: The mximum number of episodes 1: Initilize Q(s, ) nd ρ rbitrrily; 2: repet (for ech episode i) 3: s current stte; 4: Select from s using policy derived from Q(s, ); 5: Tke ction, observe r, s ; 6: Q(s, ) Q(s, ) + α[r ρ + γ mx Q(s, ) Q(s, )]; 7: if Q(s, ) = mx Q(s, ) then 8: ρ ρ + β[r ρ + mx Q(s, ) mx Q(s, )]; 9: end if 10: until i = N rte prmeter, nd r is the immedite rewrd. If in the limit the Q vlues of ll dmissible stte-ction pirs re updted infinitely often, nd α decys in wy stisfying the usul stochstic pproximtion conditions, then the Q vlues will converge to the optiml vlue Q* with probbility 1 [20]. For the Srs lgorithm, if ech ction is executed infinitely often in every stte tht is visited infinitely often, the ction is greedy with respect to the current Q vlue in the limit, nd the lerning rte decys ppropritely, then the estimted Q vlues will lso converge to the optiml vlue Q* with probbility 1 [21]. For the verge rewrd criteri, R-lerning [22] is the most widely studied RL lgorithm bsed on TD. The mjor steps of the R-lerning lgorithm re illustrted in Algorithm 2. C. MOO Problems The MOO problem cn be formulted s follows [23], [24]: mx F(X) = [ f 1 (X), f 2 (X),...,f mf (X) ] s.t. g i (X) 0, i = 1,...,m g where the mx opertor for vector is defined either in the sense of Preto optimlity or in the sense of mximizing weighted sclr of ll the elements, X = [x 1, x 2,...x N ] T R N is the vector of vribles to be optimized, functions g i (X) (i = 1, 2,...,m g ) re the constrint functions of this problem, nd f i (X) (i = 1, 2,...,m f ) re the objective functions. The optiml solutions of n MOO problem cn be described by two concepts. One is the concept of multiobjective to single

4 388 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 Fig. 3. Bsic rchitecture of MORL. Fig. 2. Concepts of Preto dominnce nd Preto front [25]. () Preto dominnce. (b) Preto front. objective, in which synthetic objective function is derived, nd the optiml solution of this MOO problem cn be obtined by solving SOO problem. The other one is the concept of Preto dominnce nd Preto front [25]. When solution A is better thn nother solution C for t lest one objective, nd A is lso superior or t lest equl to C for ll the other objectives, the solution A is sid to dominte C. In MOO, it is preferble to find ll the dominting solutions insted of the dominted ones [25]. In the cse of m f = 2, s shown in Fig. 2(), solution C is dominted by A nd B, nd it is hrd to compre the overll performnce between A nd B. As indicted in [25], the Preto front cn be generted by deleting ll the dominted solutions from the set of ll possible solutions. From Fig. 2(b), it cn be seen tht the Preto front is the set of ll the blck points, nd the solutions corresponding to the gry points re dominted by t lest one element of the Preto front. Since it is difficult to obtin the complete Preto front for ny rel-world MOO problem, simplified wy for MOO is to find set of solutions tht pproximtes the rel Preto front [26]. III. MORL PROBLEM Before providing insights into the current stte of the rt, nd determining some importnt directions for future reserch, it is necessry to chrcterize the bsic rchitecture, min reserch topics, nd nïve solutions of MORL in dvnce. A. Bsic Architecture MORL is different from trditionl RL in tht there re two or more objectives to be optimized simultneously by the lerning gent, where rewrd vector is provided for the lerning gent t ech step. Fig. 3 shows the bsic rchitecture of MORL, where there re N objectives, nd r i (N i 1) is the ith feedbck signl of the gent s current rewrd vector which is provided by the environment. Obviously this bsic rchitecture illustrtes the cse of single gent tht hs to optimize its ction policies for set of different objectives simultneously. For ech objective i (N i 1) nd sttionry policy π, there is corresponding stte-ction vlue function Q π i (s, ), which stisfies the Bellmn eqution (6) or(7). Let MQ π (s, ) = [ Q π 1 (s, ), Qπ 2 (s, ),... Qπ N (s, ) ] T where MQ π (s, ) is the vectored stte-ction vlue function, nd it lso stisfies the Bellmn eqution. The optiml vectored stte-ction vlue function is defined s MQ (s, ) = mx π MQπ (s, ) (8) nd the optiml policy π* cn lso be obtined esily by π (s) = rgmx MQ (s, ). (9) In this bsic rchitecture, the optimiztion problems of (8) nd (9) re both MOO problems. B. Mjor Reserch Topics MORL is highly interdisciplinry field nd it refers to the integrtion of MOO methods nd RL techniques to solve sequentil decision mking problems with multiple conflicting objectives. The relted disciplines of MORL include rtificil intelligence, decision nd optimiztion theory, opertions reserch, control theory, nd so on. Reserch topics of MORL re interplyed by MOO nd RL, minly including the preferences mong different objectives (the preferences my vry with time), pproprite representtions of preferences, the pproximtion of the Preto front nd the design of efficient lgorithms for specific MORL problems. Therefore, one importnt tsk of MORL is to suitbly represent the designer s preferences or ensure the optimiztion priority with some policies in the Preto front. After ppropritely expressing the preferences, the remined tsk is to design efficient MORL lgorithms tht cn solve the sequentil decision mking problems bsed on observed stte trnsition dt. C. Nïve Solutions Like MOO problems, MORL pproches cn be divided into two groups bsed on the number of the policies to be lerned [12]: single-policy pproches nd multiple-policy pproches. The im of single-policy pproches is to obtin the best policy which simultneously stisfies the preferences mong the multiple objectives s ssigned by user or defined by the ppliction domin. In this cse, the nïve pproch to solve MORL problem is to design synthetic objective function TQ(s, ), which cn suitbly represent the overll preferences. Similr to the Q-lerning lgorithm, nïve solution of singlepolicy pproches to MORL is shown in Algorithm 3. The Q-vlue updte rule for ech objective cn be expressed s ( ( Q i (s, ) = (1 α)q i (s, ) + α r i + mx Q i s, ) ) where N i 1, nd α is the lerning rte prmeter. The overll single policy cn be determined bsed on TQ(s, ),

5 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 389 Algorithm 3 Nïve Solution of Single-Policy Approches to MORL \\K: The mximum number of episodes \\N: The number of objectives 1: Initilize TQ(s, ) rbitrrily; 2: repet (for ech episode j) 3: Initilize s; 4: repet (for ech step of episode) 5: Choose from s using policy derived from TQ(s, ); 6: Tke ction, observe r 1, r 2,..., s ; 7: for i = 1, 2,..., N do 8: Q i (s, ) Q i (s, ) + α[r i + γ mx Q(s, ) Q i (s, )]; 9: end for 10: Compute TQ(s, ); 11: s s ; 12: until s is terminl 13: until j = K which cn be derived using the Q-vlues for ll the objectives. The mjor difference mong single-policy pproches is the wy in which these preferences re expressed. By mking use of the synthetic objective function, the Q-vlues of every objective cn be utilized to firly distribute control ctions. In order to ensure the diversity in the policy spce for different optimiztion objectives, multiple-policy pproches hve been studied to obtin set of policies tht cn pproximte the Preto front. The mjor difference mong multiplepolicy pproches is the mnner in which the Preto front is pproximted. However, it is hrd to pproximte the Preto front directly in mny rel-world pplictions. One nïve solution of multiple-policy pproches is to find policies in the Preto front by using different synthetic objective functions. Obviously, if set of prmeters cn be specified in synthetic objective function, the optiml policy cn be lerned for this set of prmeters. In [27], it ws illustrted tht by running the sclr Q-lerning lgorithm independently for different prmeter settings, the MORL problem cn be solved in multiple-policy wy. IV. REPRESENTATIVE APPROACHES TO MORL According to the different representtions of preferences, severl typicl pproches to MORL hve been developed. In this section, seven representtive MORL pproches re reviewed nd discussed. Among these seven representtive pproches, the weighted sum pproch, W-lerning, the nlytic hierrchy process (AHP) pproch, the rnking pproch, nd the geometric pproch re single-policy pproches. The convex hull pproch nd the vrying prmeter pproch belong to multiple-policy pproches. A. Weighted Sum Approch In [28], n lgorithm bsed on gretest mss ws studied to estimte the combined Q-function. For Q-lerning bsed on the strtegy of gretest mss, the synthetic objective function Fig. 4. Concve region of the weighted sum pproch. is generted by summing the Q-vlues for ll the objectives TQ (s, ) = N Q i (s, ). (10) Bsed on the bove synthetic objective function, the ction with the mximl summed vlue is then chosen to be executed. Since Srs(0) is n on-policy (the smples used for weight updte re generted from the current ction policy) RL lgorithm nd it does not hve the problem of positive bis, GM-Srs(0) ws proposed for MORL in [11]. The positive bis my be cused by some off-policy RL methods which only use the estimtes of greedy ctions for lerning updtes. An dvntge of GM-Srs(0) is s follows: since the updtes re bsed on the ctully selected ctions rther thn the best ction determined by the vlue function, GM-Srs(0) is expected to hve smller errors between the estimted Q-vlues nd the true Q-vlues. A nturl extension of the GM-Srs(0) pproch is the weighted sum pproch, which computes linerly weighted sum of Q-vlues for ll the objectives TQ (s, ) = i=1 N w i Q i (s, ). i=1 The weights cn mke the user hve the bility to put more or less emphsis on ech objective. In [29], the weighted sum pproch ws employed to combine seven vehicle overtking objectives, nd there re three nvigtion modes tht were used to tune the weights. In [30], similr pproch ws used for the combintion of three objectives, which represent the degree of the crowd in n elevtor, the witing time, the number of strt-ends, respectively. Although the weighted sum pproch is very simple to be implemented, the ctions in concve regions of the Preto front my not be chosen so tht the Preto front cnnot be well pproximted during the lerning process [25]. For exmple, s shown in Fig. 4, if there re two objectives nd five cndidte ctions, it cn be seen tht ctions 2, 3, nd 4 re in the concve region of the Preto front, while ctions 1 nd 5 re vertices. For ll cndidte ctions for ny positive weights {w i }(N i 1), the liner weighted sums of Q-vlues of ctions 2, 3, nd 4 re not the mximum. Thus, ctions 2, 3 nd 4 will never be selected for greedy policies. Insted, one of the two ctions 1 nd 5 will be frequently chosen ccording to the preset weights. In order to overcome this drwbck, some nonliner functions

6 390 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 for weighting Q-vlues my be used for specific problem domins [31]. B. W-Lerning Approch In order to ensure tht the selected ction is optiml for t lest one objective, severl winner-tke-ll methods for MORL were studied in [32]. When the current stte is s, vluew i (s) is computed for ech objective. Then, the selected ction is bsed on the objective with the mximl W vlue. One simple method to compute W vlues is clled Top-Q [11], [32], which ssigns the W vlue s the highest Q-vlue in the current stte W i (s) = mx Q i (s, ) 1 i N. The lrgest W vlue cn be obtined s W mx (s) = mx i W mx (s) = mx therefore the selected ction is ã = rg mx W i (s)= mx { mx Q i (s, ) i i { { mx i Q i (s, ) } } mx Q i (s, ) } 1 i N (11) 1 i N. (12) The synthetic objective function for the Top-Q pproch cn be written s TQ(s, ) = mx Q i (s, ) 1 i N. (13) i In Top-Q, the selected ction is gurnteed to be optiml for t lest one objective. However, one drwbck of this pproch is tht the objective with the highest Q-vlue my hve similr priorities for different ctions, while other objectives cnnot be stisfied due to their low ction vlues. In ddition, since the Q-vlues depend on the scling of rewrd functions, chnge in rewrd scling my influence the results of the winnertke-ll contest in (11) (13). Therefore, lthough the Top-Q pproch my obtin good performnce in some cses, its behvior will be gretly influenced by the design of rewrd functions [11]. In order to overcome the bove drwbck, W-Lerning ws studied in [32] to compute the W vlues bsed on the following rule: W i (s) = (1 α)w i (s) + αp i (s) P i (s) = mx Q i (s, ) ( r i + γ mx Q i ( s, ) ) (14) where N i 1, s is the successive stte fter ction is executed. At ech step, fter selecting nd executing the ction with the highest W vlue, ll the W vlues, except the highest W vlue (the winner or the leder), re updted ccording to the bove rule in (14). Humphrys [32] pointed out tht it my be not necessry to lern the W vlues, nd insted they cn be computed directly from the Q-vlues in process clled Negotited W-lerning, s shown in Algorithm 4. The negotited W-lerning lgorithm cn explicitly find tht if n objective is not preferred to determine the next ction, it my be expected to lose the most long-term rewrd. Algorithm 4 Negotited W-Lerning [32] \\N: The number of objectives 1: Initilize leder l with rndom integer between 1 nd N; 2: Observe stte s; 3: W l =0; 4: l = rg mx Q l (s, ); 5: loop: 6: for ll objectives i except l do 7: W i = mx Q i (s, ) Q i (s, l ); 8: end for 9: if mx i W i > W l then 10: W l = mx i W i ; 11: l = rg mx Q i (s, ); 12: l=i; 13: go to 5; 14: else 15: terminte the loop; 16: end if 17: Return l ; C. AHP Approch [34] Generlly, the designer of MORL lgorithms my not hve enough prior knowledge bout the optimiztion problem. In order to express the informtion of preferences, some qulittive rules re usully employed, such s objective B is less importnt thn objective A. The qulittive rules specify the reltive importnce between two objectives but do not provide precise mthemticl description. Thus, MORL lgorithms cn mke use of the method of AHP to obtin quntified description of the synthetic objective function TQ(s, ). Compred with the originl AHP method in [33], the MORL method proposed in [34] cn solve sequentil decision-mking problems with vrible number of objectives. Bsed on the designer s prior knowledge of the problem, the degree of reltive importnce between two objectives cn be quntified by L grdes, nd sclr vlue is defined for ech grde. For exmple, in [34], L is set to be 6, nd the evlution of the importnce of objective i reltive to objective j is supposed to be c ij, where N i 1 nd N j 1. After determining the vlue of c ij, the reltive importnce mtrix for ll objectives C =(c ij ) N N cn be obtined. With mtrix C, the importnce fctor I i (for objective i) cn be clculted [34] where I i = SL i = SL i N SL j j=1 N c i,j j=1,j =i is the importnce of objective i reltive to ll other objectives. Then, for ech objective, fuzzy inference system cn be constructed. To compre two cndidte ctions p nd q ( p, q A), both the importnce fctor I i nd the vlue of

7 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 391 improvement D i ( p, q )=Q i (s, p ) Q i (s, q ) re used s the inputs of the fuzzy system, nd the output of the fuzzy system is the goodness of p reltive to q. By incorporting the fuzzy subsets nd fuzzy inference rules, n ction selection strtegy ws constructed to solve the MORL problem [34]. The min drwbck of this pproch is tht it requires lot of prior knowledge of the problem domin. D. Rnking Approch The rnking pproch, lso clled the sequentil pproch, or the threshold pproch, ims to solve multiobjective decision problems vi n ordering or preference reltion mong multiple criteri. The ide of using ordinl reltions in optiml decision mking ws studied in the erly reserch work by Mitten [35] nd Sobel [36]. The synthetic objective function TQ(s, ) ws expressed in terms of prtil policies. To ensure the effectiveness of the subordinte objective (the less importnt objective), multiple solutions need be obtined for the optimiztion problem of the min objective. Inspired by the ide of the rnking pproch, n ordering of multiple objectives ws estblished in [37] formorl where threshold vlues were specified for some objectives in order to put the constrints on the objectives. One exmple for this kind of situtions is tht n unmnned vehicle performs nvigtion tsks in n environment while voiding its fuel level from being empty. The MORL pproch in [37] optimizes one objective while putting constrints on other objectives. The ctions re chosen bsed on the thresholds nd lexicogrphic ordering (the lst objective is mximized t first) [12], [37]. Let CQ i (s, ) = min {Q i (s, ), C i } where N i 1, C i is the threshold vlue (the mximum llowble vlue) for objective i. Since objective N is ssumed to be unconstrined, C N =+. In the rnking pproch for MORL [12], [37], given prtil ordering of ll objectives nd their threshold vlues, ctions nd cn be compred by the ction comprison mechnism shown in Algorithm 5, where there is sub-function Superior() which ws recursively defined in [12]. Bsed on the ction comprison nd selection mechnism, the MORL problem cn be solved by combining this mechnism with some stndrd RL lgorithms such s Q-lerning. The performnce of the rnking-bsed MORL pproch is minly dependent on the ordering of the objectives s well s the threshold vlues. The design of n pproprite lexicogrphic ordering of ll the objectives nd their threshold vlues still requires some prior knowledge of the problem domin, which exhibits the designer s preferences [105]. Geibel [38] employed this ide to blnce the expected return nd risk, where the risk is smller thn some specified threshold, nd the problem ws formulted s constrined MDP. In [39], the rnking-bsed MORL pproch ws pplied to the routing problem in cognitive rdio networks to ddress Algorithm 5 Action Comprison Mechnism of the Rnking Approch [12] Superior(CQ(s i, ), CQ(s i, ), i); 1: if CQ(s i, )>CQ(s i, ) then 2: Return true; 3: else if CQ(s i, ) = CQ(s i, ) then 4: if i = N then 5: Return true; 6: else 7: Return Superior(CQ(s i+1, ), CQ(s i+1, ), i + 1); 8: end if 9: else 10: Return flse; 11: end if Fig. 5. Predicted trget set (two objectives). the chllenges of rndomness, uncertinty, nd multiple metrics. E. Geometric Approch To del with dynmic unknown Mrkovin environments with long-term verge rewrd vectors, Mnnor nd Shimkin [40] proposed geometric pproch to MORL. It is ssumed tht the ctions of other gents my influence the dynmics of the environment. Sufficient conditions for stte recurrence, i.e., the gme is irreducible or ergodic, re lso ssumed to be stisfied. In [40], using the proposed geometric-bsed ide, two MORL lgorithms, clled multiple directions RL (MDRL) nd single direction RL (SDRL), were presented to pproximte desired trget set in multidimensionl objective spce, s shown in Fig. 5. This trget set cn be viewed s the synthetic objective function TQ(s, ), which stisfies some geometric conditions of dynmic gme model. The MDRL nd SDRL lgorithms re bsed on the rechbility theory for stochstic gmes, nd the min mechnism behind these two lgorithms is to steer the verge rewrd vector to the trget set. When nested clss of trget sets is prescribed, Mnnor nd Shimkin [40] lso provided n extension of the geometric MORL lgorithm, where the gol of the lerning gent ws to pproch the trget set with the smllest size. A prticulr exmple of this cse is to solve constrined MDPs with verge rewrds. In ddition, the geometric lgorithms lso need some prior knowledge of the problem domin to define the trget set.

8 392 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 Algorithm 6 Convex Hull Algorithm [42] 1: Initilize ˆQ(s, ), for ll s,, rbitrrily; 2: while not converged do 3: for ll s S, [ A do 4: ˆQ(s, ) E r(s, ) + γ hull d ˆQ(s ], ) s, ; 5: end for 6: end while 7: Return ˆQ; TABLE I REPRESENTATIVE APPROACHES TO MORL F. Convex Hull Approch Brrett nd Nrynn [42] presented multiple-policy lgorithm to MORL, which cn simultneously lern optiml policies for ll liner preference ssignments in the objective spce. Two opertions on convex hulls re defined s follows. Definition 1 [42]: Trnsltion nd scling opertions u + bv { u + b v v V}. (15) Definition 2 [42]: Summing two convex hulls U + V hull { u + v u U, v V} (16) where u nd v re vectors in convex hulls U nd V, respectively. Eqution (15) indictes tht the convex hull of trnsformed (trnsltion nd scling) convex hull is just itself nd (16) indictes tht the convex hull of two summed convex hulls requires more computtion. Bsed on the definitions, the convex hull lgorithm is illustrted Algorithm 6, where ˆQ is the vertices of the convex hull of ll possible Q-vlue vectors. This lgorithm cn be viewed s n extension of stndrd RL lgorithms where the expected rewrds used to updte the vlue function re optiml for some liner preferences. Brrett nd Nrynn [42] presented this lgorithm nd proved tht the solution cn find the optiml policy for ny liner preference function. This solution cn be simplified s the stndrd vlue itertion lgorithm when specil weight vector is used. This convex-hull technique cn be integrted with other RL lgorithms. Since multiple policies re lerned t once, the integrted RL lgorithms should be off-policy lgorithms. G. Vrying Prmeter Approch Generlly, multiple-policy pproch cn be relized by performing multiple runs with different prmeters, objective thresholds, nd orderings in ny single-policy lgorithm. For exmple, s indicted in [12] nd [27], sclrized Q-lerning cn be used in multiple-policy mnner by executing repeted runs of the Q-lerning lgorithm using different prmeters. Shelton [43] pplied policy grdient methods nd the ide of vrying prmeters to the MORL domin. In the proposed pproch, fter estimting multiple policy grdients for ech objective, weighted grdient is obtined nd multiple policies cn be found by vrying the weights of the objective grdients. In summry, the representtive pproches mentioned bove to MORL cn be briefly described in Tble I. In ddition to these pproches, there re some other MORL pproches proposed recently [44], [45]. The lgorithm proposed in [44] is n extension of multiobjective fitted Q-itertion (FQI) [54] tht cn find control policies for ll the liner combintions of preferences ssigned to the objectives in single trining procedure. By performing single-objective FQI in the stte-ction spce, s well s the weight spce, the multiobjective FQI (MOFQI) lgorithm relizes function pproximtion nd generliztion of the ction-vlue function with vectored rewrds. In [45], different immedite rewrds were given for the visited sttes by compring the objective vector of the current stte with those of the Preto optiml solutions tht hve been computed. After keeping trck of the nondominted solutions nd constructing the Preto front t the end of the optimiztion process, the Preto optiml solutions cn be memorized in n elite list. V. IMPORTANT DIRECTIONS OF RECENT RESEARCH ON MORL Although the MORL pproches summrized in Section IV re very promising for further pplictions, there re severl importnt directions in which MORL pproches hve been improved recently. A. Further Development of MORL Approches In order to obtin suitble representtions of the preferences nd improve the efficiency of MORL lgorithms, lot of efforts hve been mde recently. Estimtion of distribution lgorithms (EDA) ws used by Hnd [46] tosolvemorl problems. By incorporting the notions in evolutionry MOO, the proposed method ws ble to cquire vrious strtegies

9 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 393 by single run. Studley nd Bull [47] investigted the performnce of lerning clssifier system for MORL, nd the results demonstrted tht the choice of ction-selection policies cn gretly ffect the performnce of the lerning system. Kei et l. [48] selected effective ction cndidtes by the α-domintion strtegy nd used gol-directed bis bsed on the chievement level of ech evlution. Zhng et l. [49] nd Zho nd Zhng [50] used prllel genetic lgorithm (PGA) to evolve neuro-controller, nd perturbtion stochstic pproximtion (SPSA) ws used to improve the convergence of the proposed lgorithm. The notion of dptive mrgins [51] ws dopted to improve the performnce of MORL lgorithms [52]. To obtin Preto optiml policies in lrge or continuous spces, some recent efforts include: Khmis nd Gom [53] developed multigent frmework for MORL in trffic signl control, Cstelletti et l. [54] mde use of the FQI for pproximting the Preto front, nd Wiering nd de Jong [55] proposed new vlue itertion lgorithm, clled consistency multi objective dynmic progrmming. Most of the MORL lgorithms mentioned bove belong to single-policy pproches nd they were studied nd tested independently. The min limittion is tht the smll scle of previous MORL problems my not verify the lgorithm s performnce in deling with wide rnge of different problem settings, nd the lgorithm implementtions lwys require much prior knowledge bout the problem domin. B. Dynmic Preferences in MORL Since mny rel-world optimiztion nd control problems re not sttionry, there re growing interests in solving dynmic optimiztion problems in recent yers. Similrly, the policies obtined by MORL pproches usully rely on the preferences on the rewrd vectors, so it is necessry to develop MORL lgorithms tht tke dynmic preferences into considertion. In optiml control with multiple objectives, the designer cn use the fixed-weight pproch [56] to determine the optimiztion direction. However, if there is no exct bckground knowledge of the problem domin, the fixed-weight pproch my find unstisfctory solutions. For MOO, the rndomweight pproch [57] nd the dptive pproch [58] were studied to compute the weights bsed on the objective dt so tht the mnul selection of objective weights cn be simplified. Nevertheless, it is hrd for these two pproches to express the preference of the designer. An importnt problem is to utomticlly derive the weights when the designer is unble to ssign the weights. This problem is usully clled preference elicittion (PE), nd its mjor relted work is inverse RL [59], [60]. In order to predict the future decisions of n gent from its previous decisions, Byesin pproch ws proposed [61] to lern the utility function. In [62], n pprenticeship lerning lgorithm ws presented in which observed behviors were used to lern the objective weights of the designer. Aiming t the time-vrying preferences mong multiple objectives, Ntrjn nd Tdeplli [63] proposed n MORL Algorithm 7 Algorithm Schem for Dynmic MORL [63] 1: Obtin the current weight vector w new,setδ s threshold vlue; 2: π init =rgmx π ( w new ρ π ) 3: Compute the vlue function vectors of π init 4: Compute the verge rewrd vector of π init 5: Lern the new policy π through vector-bsed RL; 6: If ( w new ρ π w new ρ πinit )>δ, dd π to the set of stored policies. method tht cn find nd keep finite number of policies which cn be ppropritely selected for vrying weight vectors. This lgorithm is bsed on the verge rewrd criteri, nd its schem is shown in Algorithm 7, where δ is tunble prmeter. The motivtion for this lgorithm is tht despite of infinitely mny weight vectors, the set of ll optiml policies my be well represented by smll number of optiml policies. C. Evlution of MORL Approches As reltively new field of study, reserch on MORL hs minly focused on vrious principles nd lgorithms to del with multiple objectives in sequentil decision-mking problems. Although it is desirble to develop stndrd benchmrk problems nd methods for the evlution of MORL lgorithms, there hs been little work on this topic except recent work by Vmplew et l. [12]. In previous studies on MORL, the lgorithms were usully evluted on different lerning problems so tht it is necessry to define some stndrd test problems with certin chrcteristics for rigorous performnce testing of MORL lgorithms. For exmple, mny MORL lgorithms were tested seprtely on some d hoc problems, including trffic tsks [64] [67], medicl tsks [68], robot tsks [69] [71], network routing tsks [72], grids tsks [73], nd so on. As indicted in [12], it is difficult to mke comprisons mong these lgorithms due to the lck of benchmrk test problems nd methodologies. Furthermore, for d hoc ppliction problems, the Preto fronts re usully unknown nd it is hrd to find bsolute performnce mesures for MORL lgorithms. Therefore, suite of benchmrk problems with known Preto fronts were proposed in [12], together with stndrd method for performnce evlution which cn serve s bsis for future comprtive studies. In ddition, two clsses of MORL lgorithms were evluted in [12] bsed on different evlution metrics. In prticulr, single policy MORL lgorithms were tested vi online lerning while offline lerning performnce ws tested for multipolicy pproches. VI. RELATED FIELDS OF STUDY MORL is highly interdisciplinry field, nd its relted fields minly include MOO, hierrchicl RL (HRL) nd multigent RL (MARL). Their reltions to MORL nd recent progress re discussed in this section. In ddition, by using keyword-bsed serch in Google scholr, the verge number of MORL-relted publictions ws bout 90 per yer from

10 394 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH to 2001, while this verge number becomes 700 per yer from 2011 to Thus, it cn be observed tht during the pst 10 yers, there is significnt increse in the number of publictions relted to MORL. A. MOO MOO problems usully hve no unique, optiml solution, which is different from SOO tht hs single best solution. For MOO nd MORL, set of noninferior, lterntive solutions clled the Preto optiml solutions cn be defined insted of single optiml solution. Nevertheless, the im of MOO is to solve prmeter optimiztion problem with multiple objectives, while MORL is to solve sequentil decision mking problems with multiple objectives. There re two common gols for MOO lgorithms: one is to find set of solutions tht is close to the Preto optiml front, nd the other is to obtin diverse set of solutions representing the whole Preto optiml front. These two gols hve lso been studied in MORL. In order to chieve these two gols, vriety of lgorithms hve been presented to solve MOO problems, such s multiobjective prticle swrm optimizer (MOPSO) [74], multiobjective genetic lgorithms (MOGA) [14], [75], etc. These MOO lgorithms hve improved mthemticl chrcteristics for solving vrious MOO problems. Nevertheless, when the dimension of MOO problems is high, mny of these lgorithms usully hve decresed performnce due to the difficulty of finding wide rnge of lterntive solutions. Moreover, it is hrd for MOGA nd MOPSO to solve MOO problems with concve Preto fronts which re populrly encountered in rel-world pplictions. Most MOO problems hve different kinds of constrints nd they re lso clled constrined MOO problems. Until now, vrious constrint hndling pproches [76] [79] hve been proposed to improve the performnce of multi objective evolutionry lgorithm nd MOPSO. B. HRL Hierrchicl RL (HRL) mkes use of divide-nd-conquer strtegy to solve complex tsks with lrge stte or decision spces. Unlike conventionl RL, HRL ims to solve sequentil decision-mking problem tht cn be best described s set of hierrchiclly orgnized tsks nd sub-tsks. MORL differs from HRL in tht the MORL problem requires the lerning gent to solve severl tsks with different objectives t once. The most outstnding dvntge of HRL is tht it cn scle to lrge nd complex problems [80]. One common feture of HRL nd MORL is tht there re multiple tsks tht need to be solved for the lerning gent. Erlier HRL lgorithms need prior knowledge bout the high-level structure of complex MDPs. There hve been some HRL pproches or formultions to incorporte prior knowledge: HAMs [81], MAXQ [82], options [83], nd ALisp [84]. The usge of the prior knowledge cn simplify the problem decomposition nd ccelerte the lerning process for good policies. Currently, most HRL lgorithms re bsed on the semi-mdp (SMDP) model. In [85], the SMDP frmework ws extended to concurrent ctivities, multigent domins, nd prtilly observble sttes. As discussed in [103], lthough vlue function pproximtors cn be integrted with HRL, few successful results hve been reported in the literture for pplying existing HRL pproches to MDPs with lrge or continuous spces. Furthermore, to utomticlly decompose the stte spce of MDPs or construct options is still difficult tsk. Recently some new HRL lgorithms were proposed. In [86], n HRL pproch ws presented where the stte spce cn be prtitioned by criticl stte. A hierrchicl pproximte policy itertion (HAPI) lgorithm with binry-tree stte spce decomposition ws presented in [87]. In the HAPI pproch, fter decomposing the originl MDP into multiple sub-mdps with smller stte spces, better ner-optiml locl policies cn be found nd the finl globl policy cn be derived by combining the locl policies in ech sub-mdp. For MORL problems, hierrchicl lerning rchitecture with multiple objectives ws proposed in [88]. The mjor ide of this pproch is to mke use of reference network so tht n internl reinforcement representtion cn be generted during the opertion of the lerning process. Furthermore, internl reinforcement signls from different levels cn be provided to represent multilevel objectives for the lerning system. C. MARL In MORL, the lerning gent ims to solve sequentil decision problems with rewrd vectors, nd multiple policies my be obtined seprtely through decomposition. For multiple policy pproches, the MORL problem cn be solved in distributed mnner which is closely relted to multigent RL (MARL) [89]. In MARL, ech gent my lso hve its own objective nd there my be multiple objectives in the lerning system. However, most of the efforts in designing MARL systems hve been focused on the communiction nd interction (coopertion, competition, nd mixed strtegies) mong gents. In MORL, usully there is no explicit shring of informtion mong objectives. The min tsk of n MARL system is tht utonomous multiple gents explicitly consider other gents nd coordinte their ction policies so tht coherent joint behvior cn be relized. In recent yers, the reserch results on MARL cn be viewed s combintion of temporl-difference lerning, gme theory, nd direct policy serch techniques. There re severl mjor difficulties in the reserch of MARL lgorithms, which re different from single-gent settings. The first difficulty is the existence of multiple equilibriums. In Mrkov gmes with single equilibrium vlue, the optiml policy cn be well defined. But when there re multiple equilibriums, MARL lgorithms hve to ensure the gents to coordinte their policies for selecting pproprite equilibriums. From n empiricl viewpoint, we should crefully consider the influence of multiple equilibriums on MARL lgorithms, nd undesirble equilibriums my be esily reched due to certin gme properties [90]. In the design of MARL lgorithms, one mjor im is to mke the gents policies converge to desirble equilibriums. There hve been mny heuristic explortion strtegies proposed in the literture so tht the probbility for

11 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 395 reching the optiml equilibriums cn be improved in identicl interest gmes [89] [92]. VII. CHALLENGES AND OPEN PROBLEMS In ddition to the three importnt spects for further development in Section V, there re severl chllenges nd open problems in MORL, which will be discussed in the sequel. A. Stte/Feture Representtion in MORL The problem of selecting n efficient stte or feture representtion in rel-world problem hs been n importnt topic in RL [4], [8]. In the erlier reserch on RL, there hs been some importnt progress in RL theory nd lgorithms for discrete-stte MDPs. However, most rel-world problems hve lrge or continuous stte spces, thus, the huge computtionl costs will mke erlier tbulr RL lgorithms be imprcticl for rel pplictions. In the pst decde, pproximte vlue functions or policies with feture representtions hve been widely studied. But one min obstcle is tht mny RL lgorithms nd theories with good convergence properties usully rely on mnully selected feture representtions. Thus, it is difficult to ccurtely pproximte the optiml vlue functions or policies without crefully selected fetures. Some recent dvnces in utomtic feture representtion include kernel methods nd grph Lplcin pproches for RL [93], [94]. In MORL, there re multiple objectives to be chieved nd multiple vlue functions my need to be pproximted simultneously. The feture representtion problem in MORL is more complicted due to the existence of multiple objectives or vlue functions. Therefore, stte or feture representtion in MORL is still big chllenge for further reserch. B. Vlue Function Approximtion in MORL As discussed bove, to solve MDPs with lrge or continuous stte spces, vlue function pproximtion (VFA) is key technique to relize generliztion nd improve lerning efficiency [95]. According to the bsic properties of function pproximtors, there re two different kinds of VFA methods, i.e., liner VFA [96], [97] nd nonliner VFA [9], [98], [99]. In some rel-world pplictions, multilyer neurl networks were commonly employed s the nonliner pproximtors for VFA. However, the empiricl results of successful RL pplictions using nonliner VFA commonly lck rigorous theoreticl nlysis. Some negtive results concerning divergence were reported for Q-lerning nd TD lerning bsed on direct grdient rules [4], [20]. Hence, mny RL pproches with VFA require significnt design efforts or problem insights, nd it is hrd to find bsis function set tht is both sufficiently simple nd sufficiently relible. To solve MDPs with lrge or continuous stte spces, MORL lgorithms lso require VFA to improve generliztion bility nd reduce computtionl costs. The dditionl representtion of the preferences mong different objectives is more difficult for developing VFA techniques, especilly when the MORL problem hs dynmic preferences. Hence, VFA becomes greter chllenge for MORL thn tht for stndrd RL. C. Convergence Anlysis of MORL Algorithms Both the Q-lerning lgorithm nd the Srs lgorithm hve some ttrctive qulities s bsic pproches to RL. The mjor dvntge is the fct tht they re gurnteed to converge to the optiml solution for single MDP with discrete stte nd ction spces. Suppose there re N objectives in n MORL problem, then this problem cn be considered s N sub-mdps ( sub-mdp is n MDP with one single objective). Hence, the convergence results of the lgorithm to solve this MORL problem not only depend on the convergence of ll the lgorithms to solve these sub-mdps but lso depend on the representtions of the preferences mong ll the objectives. The convergence of single-policy pproches to MORL cn be nlyzed bsed on the results of the RL lgorithms used to solve sub-mdps. However, the convergence of multiple-policy pproches hs to consider the wy how the Preto front is pproximted. In short, for sttionry preferences, the convergence nlysis of MORL lgorithms is minly dependent on the properties of the lerning lgorithms to solve MDPs nd the representtions of the preferences. For dynmic preferences, the dynmic chrcteristics must be considered dditionlly. So fr, the convergence of MORL lgorithms commonly lcks rigorous theoreticl results. D. MORL in Multigent Systems A multigent system (MAS) is system tht hs multiple intercting utonomous gents, nd there re incresing numbers of ppliction domins tht re more suitble to be solved by multigent system insted of centrlized single gent [88], [100]. MORL in multigent systems is very importnt reserch topic, due to the multiobjective nture of mny prcticl multigent systems. How to extend the lgorithms nd theories of MORL in single-gent systems to MORL in multigent systems is n open problem. In prticulr, to chieve severl objectives t once by the coopertion of multiple gents in multigent systems is difficult problem to be solved. If there re competitions in multigent systems, the MORL problem will become more chllenging. E. Applictions of MORL In recent yers, RL hs been pplied in vriety of fields [101] [104], but the reserch on MORL lgorithms is reltively new field of study, nd s such there re few relworld pplictions so fr. The following difficult problems should be studied before MORL cn be successfully pplied in rel-time complex systems. One is to develop efficient MORL lgorithms tht cn solve MDPs with lrge or continuous stte spces. The other is to evlute the performnce of different MORL lgorithms in prctice nd find the gp between theoreticl nlysis nd rel performnce. The third one is to consider the constrints from rel-world pplictions when developing theoreticl models for MORL nd to investigte how theoreticl results cn promote the development of new lgorithms nd mechnisms. Since in mny rel-world pplictions, there re multiple conflict objectives to be optimized simultneously, it cn be

12 396 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 expected tht MORL will find more ppliction domins with the development of new theory nd lgorithms. VIII. CONCLUSION In this pper, the bckground, bsic rchitectures, mjor reserch topics, nd nïve solutions of MORL were introduced t first, then severl representtive pproches were reviewed nd some importnt directions of recent reserch were discussed in detil. There re two min dvntges of MORL. One is tht MORL is very useful to improve the performnce of single objective RL by generting highly diverse Preto-optiml models for constructing policy ensembles in domins with multiple objectives. The second is tht MORL lgorithms cn relize trde-off between ccurcy nd interpretbility for sequentil decision-mking tsks. For single-policy pproches, the weighted sum nd W-lerning pproches re very simple to implement, but they cnnot express exctly the preferences of the designer. The AHP, rnking, nd geometric pproches my express the preferences more exctly, but they need more prior knowledge of the problem domin. For multiple-policy pproches, the convex hull lgorithm cn lern optiml policies for ll liner preference ssignments over the objective spce t once. The vrying prmeter pproch cn be esily implemented by performing multiple runs with different prmeters, objective threshold vlues, nd orderings in ny single-policy lgorithm. MORL pproches hve been improved recently in three importnt spects: enhncing their solution qulities, dpting dynmic preferences nd constructing evlution systems. The min chllenges nd open problems in MORL include vlue function pproximtion, feture representtion, convergence nlysis of lgorithms nd the ppliction of MORL to multigent systems nd rel world problems. It cn be expected tht there will be more nd more reserch progress towrd these directions. ACKNOWLEDGMENT The uthors would like to thnk the Associte Editors nd nonymous reviewers for their vluble comments nd suggestions, which gretly improved the qulity of this pper. REFERENCES [1] E. L. Thorndike, Animl Intelligence. Drien, CT, USA: Hlfner, [2] A. M. Turing, Computing mchinery nd intelligence, Mind, vol. 59, pp , Oct [3] L. P. Kelbling, M. L. Littmn, nd A. W. Moore, Reinforcement lerning: A survey, J. Artif. Intell. Res., vol. 4, no. 1, pp , [4] R. S. Sutton nd A. G. Brto, Reinforcement Lerning: An Introduction. Cmbridge, MA, USA: MIT Press, [5] R. S. Sutton, Lerning to predict by the method of temporl differences, Mch. Lern., vol. 3, no. 1, pp. 9 44, [6] C. J. C. H. Wtkins nd P. Dyn, Q-lerning, Mch. Lern., vol.8, no. 3, pp , [7] F. Y. Wng, H. Zhng, nd D. Liu, Adptive dynmic progrmming: An introduction, IEEE Comput. Intell. Mg., vol. 4, no. 2, pp , My [8] X. Xu, D. W. Hu, nd X. C. Lu, Kernel bsed lest-squres policy itertion for reinforcement lerning, IEEE Trns. Neurl Netw., vol. 18, no. 4, pp , Jul [9] R. H. Crites nd A. G. Brto, Elevtor group control using multiple reinforcement lerning gents, Mch. Lern., vol. 33, nos. 2 3, pp , [10] G. J. Tesuro, Prcticl issues in temporl difference lerning, Mch. Lern., vol. 8, nos. 3 4, pp , [11] N. Sprgue nd D. Bllrd, Multiple-gol reinforcement lerning with modulr Srs(0), in Proc. 18th Int. Joint Conf. Artif. Intell., 2003, pp [12] P. Vmplew, R. Dzeley, A. Berry, R. Issbekov, nd E. Dekker, Empiricl evlution methods for multiobjective reinforcement lerning lgorithms, Mch. Lern., vol. 84, nos. 1 2, pp , [13] G. Mitsuo nd C. Runwei, Genetic Algorithms nd Engineering Optimiztion. Beijing, Chin: Tsinghu Univ. Press, [14] I. Y. Kim nd O. L. de Weck, Adptive weighted sum method for multiobjective optimiztion: A new method for Preto front genertion, Struct. Multidiscipl. Optim., vol. 31, no. 2, pp , [15] A. Konk, D. W. Coitb, nd A. E. Smith, Multi-objective optimiztion using genetic lgorithms: A tutoril, Relib. Eng. Syst. Sfety, vol. 91, no. 9, pp , Sep [16] M. Yoon, Y. Yun, nd H. Nkym, Sequentil Approximte Multiobjective Optimiztion Using Computtionl Intelligence. Berlin, Germny: Springer, [17] J. G. Lin, On min-norm nd min-mx methods of multi-objective optimiztion, Mth. Progrm., vol. 103, no. 1, pp. 1 33, [18] D. M. Roijers, P. Vmplew, S. Whiteson, nd R. Dzeley, A survey of multi-objective sequentil decision-mking, J. Artif. Intell. Res., vol. 48, no. 1, pp , Oct [19] J. Si, A. Brto, W. Powell, nd D. Wunsch, Hndbook of Lerning nd Approximte Dynmic Progrmming. Hoboken, NJ, USA: Wiley-IEEE Press, [20] T. Jkkol, M. I. Jordn, nd S. P. Singh, On the convergence of stochstic itertive dynmic progrmming lgorithms, Neurl Comput., vol. 6, no. 6, pp , Nov [21] S. P. Singh, T. Jkkol, M. L. Littmn, nd C. Szepesvri, Convergence results for single-step on-policy reinforcement lerning lgorithms, Mch. Lern., vol. 38, no. 3, pp , Mr [22] A. Schwrtz, A reinforcement lerning method for mximizing undiscounted rewrds, in Proc. 10th Int. Conf. Mch. Lern., 1993, pp [23] H. L. Lio, Q. H. Wu, nd L. Jing, Multi-objective optimiztion by reinforcement lerning for power system disptch nd voltge stbility, in Proc. Innov. Smrt Grid Technol. Conf. Eur., Gothenburg, Sweden, 2010, pp [24] K. Sindhy, Hybrid evolutionry multi-objective optimiztion with enhnced convergence nd diversity, Ph.D. disserttion, Dept. Mth. Inf. Tech., Univ. Jyvskyl, Jyvskyl, Finlnd, [25] P. Vmplew, J. Yerwood, R. Dzeley, nd A. Berry, On the limittions of sclristion for multi-objective reinforcement lerning of Preto fronts, in Proc. 21st Aust. Joint Conf. Artif. Intell., vol , pp [26] E. Zitzler, L. Thiele, M. Lumnns, C. M. Fonsec, nd V. G. d Fonsec, Performnce ssessment of multiobjective optimizers: An nlysis nd review, IEEE Trns. Evol. Comput.,vol.7, no. 2, pp , Apr [27] A. Cstelletti, G. Corni, A. Rizzolli, R. Soncinie-Sess, nd E. Weber, Reinforcement lerning in the opertionl mngement of wter system, in Proc. IFAC Workshop Model. Control Environ. Issues, Yokohm, Jpn, 2002, pp [28] J. Krlsson, Lerning to solve multiple gols, Ph.D. disserttion, Dept. Comput. Sci., Univ. Rochester, Rochester, NY, USA, [29] D. C. K. Ngi nd N. H. C. Yung, A multiple gol reinforcement lerning method for complex vehicle overtking mneuvers, IEEE Trns. Intell. Trnsp. Syst., vol. 12, no. 2, pp , Jun [30] F. Zeng, Q. Zong, Z. Sun, nd L. Dou, Self-dptive multi-objective optimiztion method design bsed on gent reinforcement lerning for elevtor group control systems, in Proc. 8th World Congr. Int. Control Autom., Jinn, Chin, 2010, pp [31] G. Tesuro et l., Mnging power consumption nd performnce of computing systems using reinforcement lerning, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 2007, pp [32] M. Humphrys, Action selection methods using reinforcement lerning, in From Animls to Animts 4, P. Mes, M. Mtric, J.-A. Meyer, J. Pollck, nd S. W. Wilson, Eds. Cmbridge, MA, USA: MIT Press, 1996, pp

13 LIU et l.: MORL: A COMPREHENSIVE OVERVIEW 397 [33] X. N. Shen, Y. Guo, Q. W. Chen, nd W. L. Hu, A multi-objective optimiztion genetic lgorithm incorporting preference informtion, Inf. Control, vol. 36, no. 6, pp , [34] Y. Zho, Q. W. Chen, nd W. L. Hu, Multi-objective reinforcement lerning lgorithm for MOSDMP in unknown environment, in Proc. 8th World Congr. Int. Control Autom., 2010, pp [35] L. G. Mitten, Composition principles for synthesis of optimum multistge processes, Oper. Res., vol. 12, pp , Aug [36] M. J. Sobel, Ordinl dynmic progrmming, Mnge. Sci., vol. 21, pp , My [37] Z. Gbor, Z. Klmr, nd C. Szepesvri, Multi-criteri reinforcement lerning, in Proc. 15th Int. Conf. Mch. Lern., 1998, pp [38] P. Geibel, Reinforcement lerning with bounded risk, in Proc. 18th Int. Conf. Mch. Lern., 2001, pp [39] K. Zheng, H. Li, R. C. Qiu, nd S. Gong, Multi-objective reinforcement lerning bsed routing in cognitive rdio networks: Wlking in rndom mze, in Proc. Int. Conf. Comput. Netw. Commun., 2012, pp [40] S. Mnnor nd N. Shimkin, A geometric pproch to multi-criterion reinforcement lerning, J. Mch. Lern. Res., vol. 5, pp , Jn [41] S. Mnnor nd N. Shimkin, The steering pproch for multi-criteri reinforcement lerning, in Proc. Adv. Neurl Inf. Process. Syst., 2001, pp [42] L. Brrett nd S. Nrynn, Lerning ll optiml policies with multiple criteri, in Proc. 25th Int. Conf. Mch. Lern., 2008, pp [43] C. R. Shelton, Blncing multiple sources of rewrd in reinforcement lerning, in Proc. Adv. Neurl Inf. Process. Syst., 2000, pp [44] A. Cstelletti, F. Pinosi, nd M. Restelli, Tree-bsed fitted Q-itertion for multi-objective Mrkov decision problems, in Proc. Int. Joint Conf. Neurl Netw., 2012, pp [45] H. L. Liu nd Q. H. Wu, Multi-objective optimiztion by reinforcement lerning, in Proc. IEEE Congr. Evol. Comput., 2010, pp [46] H. Hnd, Solving multi-objective reinforcement lerning problems by EDA-RL-cquisition of vrious strtegies, in Proc. 9th Int. Conf. Int. Syst. Design Appl., 2009, pp [47] M. Studley nd L. Bull, Using the XCS clssifier system for multiobjective reinforcement lerning, Artif. Life, vol. 13, no. 1, pp , [48] A. Kei, S. Jun, A. Tknobu, I. Kokoro, nd K. Shigenobu, Multicriteri reinforcement lerning bsed on gol-directed explortion nd its ppliction to bipedl wlking robot, Trns. Inst. Syst. Control Inf. Eng., vol. 18, no. 10, pp , [49] H. J. Zhng, J. Zho, R. Wng, nd T. M, Multi-objective reinforcement lerning lgorithm nd its ppliction in drive system, in Proc. 34th Annu. IEEE Conf. Ind. Electron., Orlndo, FL, USA, 2008, pp [50] J. Zho nd H. J. Zhng, Multi-objective reinforcement lerning lgorithm nd its improved convergency method, in Proc. 6th IEEE Conf. Ind. Electron. Appl., Beijing, Chin, 2011, pp [51] K. Hirok, M. Yoshid, nd T. Mishim, Prllel reinforcement lerning for weighted multi-criteri model with dptive mrgin, Cogn. Neurodyn., vol. 3, pp , Mr [52] X. L. Chen, X. C. Ho, H. W. Lin, nd T. Murt, Rule driven multi objective dynmic scheduling by dt envelopment nlysis nd reinforcement lerning, in Proc. IEEE Int. Conf. Autom. Logist., Hong Kong, 2010, pp [53] M. A. Khmis nd W. Gom, Adptive multi-objective reinforcement lerning with hybrid explortion for trffic signl control bsed on coopertive multi-gent frmework, Eng. Appl. Artif. Intell., vol. 29, pp , Mr [54] A. Cstelletti, F. Pinosi, nd M. Restelli, Multi-objective fitted Q-itertion: Preto frontier pproximtion in one single run, in Proc. IEEE Int. Conf. Netw. Sens. Control, Delft, The Netherlnds, 2011, pp [55] M. A. Wiering nd E. D. de Jong, Computing optiml sttionry policies for multi-objective Mrkov decision processes, in Proc. IEEE Int. Symp. Approx. Dyn. Progrm. Reinf. Lern., Honolulu, HI, USA, 2007, pp [56] M. Oubbti, P. Levi, nd M. Schnz, A fixed-weight RNN dynmic controller for multiple mobile robots, in Proc. 24th IASTED Int. Conf. Model. Identif. Control, 2005, pp [57] C. Q. Zhng, J. J. Zhng, nd X. H. Gu, The ppliction of hybrid genetic prticle swrm optimiztion lgorithm in the distribution network reconfigurtions multi-objective optimiztion, in Proc. 3rd Int. Conf. Nt. Comput., vol , pp [58] D. Zheng, M. Gen, nd R. Cheng, Multiobjective optimiztion using genetic lgorithms, Eng. Vl. Cot Anl., vol. 2, pp , [59] A. Y. Ng nd S. Russell, Algorithms for inverse reinforcement lerning, in Proc. 17th Int. Conf. Mch. Lern., 2000, pp [60] C. Boutilier, A POMDP formultion of preference elicittion problems, in Proc. 18th Nt. Conf. Artif. Intell., 2002, pp [61] U. Chjewsk, D. Koller, nd D. Ormoneit, Lerning n gent s utility function by observing behvior, in Proc. 18th Int. Conf. Mch. Lern., 2001, pp [62] P. Abbeel nd A. Y. Ng, Apprenticeship lerning vi inverse reinforcement lerning, in Proc. 21st Int. Conf. Mch. Lern., 2004, pp [63] S. Ntrjn nd P. Tdeplli, Dynmic preferences in multi-criteri reinforcement lerning, in Proc. 22nd Int. Conf. Mch. Lern., 2005, pp [64] M. A. Khmis, W. Gom, Adptive multi-objective reinforcement lerning with hybrid explortion for trffic signl control bsed on coopertive multi-gent frmework, Eng. Appl. Artif. Intell., vol. 29, pp , [65] Z. H. Yng nd K. G. Wen, Multi-objective optimiztion of freewy trffic flow vi fuzzy reinforcement lerning method, in Proc. 3rd Int. Conf. Adv. Comput. Theory Eng., vol. 5, 2010, pp [66] K. G. Wen, W. G. Yng, nd S. R. Qu, Efficiency nd equity bsed freewy trffic network flow control, in Proc. 2nd Int. Conf. Comput. Autom. Eng., vol , pp [67] H. L. Dun, Z. H. Li, nd Y. Zhng, Multi-objective reinforcement lerning for trffic signl control using vehiculr d hoc network, EURASIP J. Adv. Signl Process., vol. 2010, pp. 1 7, Mr. 2010, Art. ID [68] R. S. H. Istepnin, N. Y. Philip, nd M. G. Mrtini, Medicl QoS provision bsed on reinforcement lerning in ultrsound streming over 3.5G wireless systems, IEEE J. Select. Ares Commun., vol. 27, no. 4, pp , My [69] Y. Nojim, F. Kojim, nd N. Kubot, Locl episode-bsed lerning of multi-objective behvior coordintion for mobile robot in dynmic environments, in Proc. 12th IEEE Int. Conf. Fuzzy Syst., vol , pp [70] D. C. K. Ngi nd N. H. C. Yung, Automted vehicle overtking bsed on multiple-gol reinforcement lerning frmework, in Proc. IEEE Int. Conf. Control Appl., Settle, WA, USA, 2010, pp [71] T. Miyzki, An evlution pttern genertion scheme for electric components in hybrid electric vehicles, in Proc. 5th IEEE Int. Conf. Int. Syst., Yokohm, Jpn, 2010, pp [72] A. Petrowski, F. Aissnou, I. Benyhi, nd S. Houcke, Multicriteri reinforcement lerning bsed on Russin doll method for network routing, in Proc. 5th IEEE Int. Conf. Intell. Syst., 2010, pp [73] J. Perez, C. Germin-Renud, B. Kegl, nd C. Loomis, Multi-objective reinforcement lerning for responsive grids, J. Grid Comput., vol. 8, no. 3, pp , [74] L. V. S. Quintero, N. R. Sntigo, nd C. A. C. Coello, Towrds more efficient multi-objective prticle swrm optimizer, in Multi- Objective Optimiztion in Computtionl Intelligence: Theory nd Prctice. Hershey, PA, USA: IGI Globl, 2008, pp [75] H. Li nd Q. F. Zhng, MOEA/D: A multiobjective evolutionry lgorithm bsed on decomposition, IEEE Trns. Evol. Comput., vol. 11, no. 6, pp , Dec [76] G. Tsggouris nd C. Zroligis, Multiobjective optimiztion: Improved FPTAS for shortest pths nd non-liner objectives with pplictions, Theory Comput. Syst., vol. 45, no. 1, pp , [77] J. P. Dubus, C. Gonzles, nd P. Perny, Multiobjective optimiztion using GAI models, in Proc. Int. Conf. Artif. Intell., 2009, pp [78] P. Perny nd O. Spnjrd, Ner dmissible lgorithms for multiobjective serch, in Proc. Eur. Conf. Artif. Intell., 2008, pp [79] G. G. Yen nd W. F. Leong, A multiobjective prticle swrm optimizer for constrined optimiztion, Int. J. Swrm Intell. Res., vol. 2, no. 1, pp. 1 23, [80] N. Dethlefs nd H. Cuyhuitl, Hierrchicl reinforcement lerning for dptive text genertion, in Proc. 6th Int. Conf. Nt. Lng. Gener., 2010, pp [81] R. Prr nd S. Russell, Reinforcement lerning with hierrchies of mchines, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 1997, pp

14 398 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 45, NO. 3, MARCH 2015 [82] T. Dietterich, Hierrchicl reinforcement lerning with the MxQ vlue function decomposition, J. Artif. Intell. Res., vol. 13, no. 1, pp , Aug [83] D. Precup nd R. Sutton, Multi-time models for temporlly bstrct plnning, in Advnces in Neurl Informtion Processing Systems. Cmbridge, MA, USA: MIT Press, 1998, pp [84] D. Andre nd S. Russell, Stte bstrction for progrmmble reinforcement lerning gents, in Proc. 18th Nt. Conf. Artif. Intell., 2002, pp [85] A. G. Brto nd S. Mhdevn, Recent dvnces in hierrchicl reinforcement lerning, Discrete Event Dyn. Syst. Theory Appl., vol. 13, nos. 1 2, pp , [86] Z. Jin, W. Y. Liu, nd J. Jin, Prtitioning the stte spce by criticl sttes, in Proc. 4th Int. Conf. Bio-Inspired Comput., 2009, pp [87] X. Xu, C. Liu, S. Yng, nd D. Hu, Hierrchil pproximte policy itertion with binry-tree stte spce decomposition, IEEE Trns. Neurl Netw., vol. 22, no. 12, pp , Dec [88] H. B. He nd B. Liu, A hierrchicl lerning rchitecture with multiple-gol representtions bsed on dptive dynmic progrmming, in Proc. Int. Conf. Netw. Sens. Control, 2010, pp [89] L. Busoniu, R. Bbusk, nd B. De Schutter, A comprehensive survey of multi-gent reinforcement lerning, IEEE Trns. Syst., Mn, Cybern.C,Appl.Rev., vol. 38, no. 2, pp , Mr [90] C. Clus nd C. Boutilier, The dynmics of reinforcement lerning in coopertive multigent systems, in Proc. 15th Nt. Conf. Artif. Intell., 1998, pp [91] S. Devlin nd D. Kudenko, Theoreticl considertions of potentilbsed rewrd shping for multi-gent systems, in Proc. 10th Annu. Int. Conf. Auton. Agents Multigent Syst., 2011, pp [92] F. Leon, Evolving equilibrium policies for multigent reinforcement lerning problem with stte ttrctors, in Proc. Int. Conf. Comput. Collect. Intell., Gdyni, Polnd, 2011, pp [93] X. Xu, Z. Hou, C. Lin, nd H. He, Online lerning control using dptive critic designs with sprse kernel mchines, IEEE Trns. Neurl Netw. Lern. Syst., vol. 24, no. 5, pp , My [94] S. Mhdevn nd M. Mggioni, Proto-vlue functions: A Lplcin frmework for lerning representtion nd control in Mrkov decision processes, J. Mch. Lern. Res., vol. 8, pp , Jn [95] D. Liu, Y. Zhng, nd H. Zhng, A self-lerning cll dmission control scheme for CDMA cellulr networks, IEEE Trns. Neurl Netw., vol. 16, no. 5, pp , Sep [96] X. Xu, H. G. He, nd D. W. Hu, Efficient reinforcement lerning using recursive lest-squres methods, J. Artif. Intell. Res., vol. 16, pp , Apr [97] J. Boyn, Technicl updte: Lest-squres temporl difference lerning, Mch. Lern., vol. 49, nos. 2 3, pp , [98] G. Tesuro, TD-Gmmon, self-teching bckgmmon progrm, chieves mster-level ply, Neurl Comput., vol.6,no.2, pp , Mr [99] W. Zhng nd T. Dietterich, A reinforcement lerning pproch to job-shop scheduling, in Proc. 14th Int. Joint Conf. Artif. Intell., 1995, pp [100] J. Wu et l., A novel multi-gent reinforcement lerning pproch for job scheduling in grid computing, Future Gen. Comput. Syst., vol. 27, no. 5, pp , [101] H. S. Ahn et l., An optiml stellite ntenn profile using reinforcement lerning, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 41, no. 3, pp , My [102] F. Bernrdo, R. Agustí, J. Pérez-Romero, nd O. Sllent, Intercell interference mngement in OFDMA networks: A decentrlized pproch bsed on reinforcement lerning, IEEE Trns. Syst., Mn, Cybern.C,Appl.Rev., vol. 41, no. 6, pp , Nov [103] S. Adm, L. Busoniu, nd R. Bbusk, Experience reply for rel-time reinforcement lerning control, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 42, no. 2, pp , Mr [104] F. Hernndez-del-Olmo, E. Gudioso, nd A. Nevdo, Autonomous dptive nd ctive tuning up of the dissolved oxygen setpoint in wstewter tretment plnt using reinforcement lerning, IEEE Trns. Syst., Mn, Cybern. C, Appl. Rev., vol. 42, no. 5, pp , Sep [105] R. Issbekov nd P. Vmplew, An empiricl comprison of two common multiobjective reinforcement lerning lgorithms, in Proc. 25th Int. Austrls. Joint Conf., Sydney, NSW Austrli, 2012, pp Chunming Liu received the B.Sc. nd M.Sc. degrees from the Ntionl University of Defence Technology, Chngsh, Chin, in 2004 nd 2006, respectively, where he is currently pursuing the Ph.D. degree. His current reserch interests include intelligent systems, mchine lerning, nd utonomous lnd vehicles. Xin Xu (M 07 SM 12) received the B.S. degree in electricl engineering from the Ntionl University of Defense Technology (NUDT), Chngsh, Chin, in 1996, where he received the Ph.D. degree in control engineering from the College of Mechtronics nd Automtion, in He is currently Full Professor with the Institute of Unmnned Systems, College of Mechtronics nd Automtion, NUDT. He hs been Visiting Scientist for Coopertion Reserch with Hong Kong Polytechnic University, Hong Kong, University of Albert, Edmonton, AB, Cnd, University of Guelph, Guelph, ON, Cnd, nd University of Strthclyde, Glsgow, U.K. His current reserch interests include reinforcement lerning, pproximte dynmic progrmming, mchine lerning, robotics, nd utonomous vehicles. He hs uthored or couthored over 100 ppers in interntionl journls nd conferences, nd hs couthored four books. He currently serves s n Associte Editor for the Informtion Sciences journl, nd Guest Editor of the Interntionl Journl of Adptive Control nd Signl Processing. Dr. Xu ws the recipient of the 2nd Clss Ntionl Nturl Science Awrd of Chin in 2012 nd the Fork Ying Tong Youth Techer Fund of Chin in He is Committee Member of the IEEE Technicl Committee on Approximte Dynmic Progrmming nd Reinforcement Lerning nd the IEEE Technicl Committee on Robot Lerning. He ws PC Member or Session Chir of vrious interntionl conferences. Dewen Hu (SM 09) ws born in Hunn, Chin, in He received the B.Sc. nd M.Sc. degrees from Xi n Jiotong University, Xi n, Chin, in 1983 nd 1986, respectively, nd the Ph.D. degree from the Ntionl University of Defense Technology, Chngsh, Chin, in From 1986, he ws with the Ntionl University of Defense Technology. From 1995 to 1996, he ws Visiting Scholr with the University of Sheffield, Sheffield, U.K., nd ws promoted s Professor in His current reserch interests include imge processing, system identifiction nd control, neurl networks, nd cognitive science. Dr. Hu ws the recipient of the 2nd Clss Ntionl Nturl Science Awrd of Chin in He is n Action Editor of Neurl Networks.

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

A New Grey-rough Set Model Based on Interval-Valued Grey Sets

A New Grey-rough Set Model Based on Interval-Valued Grey Sets Proceedings of the 009 IEEE Interntionl Conference on Systems Mn nd Cybernetics Sn ntonio TX US - October 009 New Grey-rough Set Model sed on Intervl-Vlued Grey Sets Wu Shunxing Deprtment of utomtion Ximen

More information

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d Interntionl Industril Informtics nd Computer Engineering Conference (IIICEC 15) Driving Cycle Construction of City Rod for Hybrid Bus Bsed on Mrkov Process Deng Pn1,, Fengchun Sun1,b*, Hongwen He1, c,

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies Stte spce systems nlysis (continued) Stbility A. Definitions A system is sid to be Asymptoticlly Stble (AS) when it stisfies ut () = 0, t > 0 lim xt () 0. t A system is AS if nd only if the impulse response

More information

Generation of Lyapunov Functions by Neural Networks

Generation of Lyapunov Functions by Neural Networks WCE 28, July 2-4, 28, London, U.K. Genertion of Lypunov Functions by Neurl Networks Nvid Noroozi, Pknoosh Krimghee, Ftemeh Sfei, nd Hmed Jvdi Abstrct Lypunov function is generlly obtined bsed on tril nd

More information

A signalling model of school grades: centralized versus decentralized examinations

A signalling model of school grades: centralized versus decentralized examinations A signlling model of school grdes: centrlized versus decentrlized exmintions Mri De Pol nd Vincenzo Scopp Diprtimento di Economi e Sttistic, Università dell Clbri m.depol@unicl.it; v.scopp@unicl.it 1 The

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Credibility Hypothesis Testing of Fuzzy Triangular Distributions 666663 Journl of Uncertin Systems Vol.9, No., pp.6-74, 5 Online t: www.jus.org.uk Credibility Hypothesis Testing of Fuzzy Tringulr Distributions S. Smpth, B. Rmy Received April 3; Revised 4 April 4 Abstrct

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

MATH 144: Business Calculus Final Review

MATH 144: Business Calculus Final Review MATH 144: Business Clculus Finl Review 1 Skills 1. Clculte severl limits. 2. Find verticl nd horizontl symptotes for given rtionl function. 3. Clculte derivtive by definition. 4. Clculte severl derivtives

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance Generl structure ECO 37 Economics of Uncertinty Fll Term 007 Notes for lectures 4. Stochstic Dominnce Here we suppose tht the consequences re welth mounts denoted by W, which cn tke on ny vlue between

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Numerical Integration

Numerical Integration Chpter 1 Numericl Integrtion Numericl differentition methods compute pproximtions to the derivtive of function from known vlues of the function. Numericl integrtion uses the sme informtion to compute numericl

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 2013 Outline 1 Riemnn Sums 2 Riemnn Integrls 3 Properties

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Chapter 3 Polynomials

Chapter 3 Polynomials Dr M DRAIEF As described in the introduction of Chpter 1, pplictions of solving liner equtions rise in number of different settings In prticulr, we will in this chpter focus on the problem of modelling

More information

Probabilistic Investigation of Sensitivities of Advanced Test- Analysis Model Correlation Methods

Probabilistic Investigation of Sensitivities of Advanced Test- Analysis Model Correlation Methods Probbilistic Investigtion of Sensitivities of Advnced Test- Anlysis Model Correltion Methods Liz Bergmn, Mtthew S. Allen, nd Dniel C. Kmmer Dept. of Engineering Physics University of Wisconsin-Mdison Rndll

More information

CBE 291b - Computation And Optimization For Engineers

CBE 291b - Computation And Optimization For Engineers The University of Western Ontrio Fculty of Engineering Science Deprtment of Chemicl nd Biochemicl Engineering CBE 9b - Computtion And Optimiztion For Engineers Mtlb Project Introduction Prof. A. Jutn Jn

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

Riemann Sums and Riemann Integrals

Riemann Sums and Riemann Integrals Riemnn Sums nd Riemnn Integrls Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University August 26, 203 Outline Riemnn Sums Riemnn Integrls Properties Abstrct

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

Improper Integrals, and Differential Equations

Improper Integrals, and Differential Equations Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

1.9 C 2 inner variations

1.9 C 2 inner variations 46 CHAPTER 1. INDIRECT METHODS 1.9 C 2 inner vritions So fr, we hve restricted ttention to liner vritions. These re vritions of the form vx; ǫ = ux + ǫφx where φ is in some liner perturbtion clss P, for

More information

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations

Fig. 1. Open-Loop and Closed-Loop Systems with Plant Variations ME 3600 Control ystems Chrcteristics of Open-Loop nd Closed-Loop ystems Importnt Control ystem Chrcteristics o ensitivity of system response to prmetric vritions cn be reduced o rnsient nd stedy-stte responses

More information

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0) 1 Tylor polynomils In Section 3.5, we discussed how to pproximte function f(x) round point in terms of its first derivtive f (x) evluted t, tht is using the liner pproximtion f() + f ()(x ). We clled this

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

A recursive construction of efficiently decodable list-disjunct matrices

A recursive construction of efficiently decodable list-disjunct matrices CSE 709: Compressed Sensing nd Group Testing. Prt I Lecturers: Hung Q. Ngo nd Atri Rudr SUNY t Bufflo, Fll 2011 Lst updte: October 13, 2011 A recursive construction of efficiently decodble list-disjunct

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System SPIE Aerosense 001 Conference on Signl Processing, Sensor Fusion, nd Trget Recognition X, April 16-0, Orlndo FL. (Minor errors in published version corrected.) A Signl-Level Fusion Model for Imge-Bsed

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

Lecture Note 9: Orthogonal Reduction

Lecture Note 9: Orthogonal Reduction MATH : Computtionl Methods of Liner Algebr 1 The Row Echelon Form Lecture Note 9: Orthogonl Reduction Our trget is to solve the norml eution: Xinyi Zeng Deprtment of Mthemticl Sciences, UTEP A t Ax = A

More information

Learning Moore Machines from Input-Output Traces

Learning Moore Machines from Input-Output Traces Lerning Moore Mchines from Input-Output Trces Georgios Gintmidis 1 nd Stvros Tripkis 1,2 1 Alto University, Finlnd 2 UC Berkeley, USA Motivtion: lerning models from blck boxes Inputs? Lerner Forml Model

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior Reversls of Signl-Posterior Monotonicity for Any Bounded Prior Christopher P. Chmbers Pul J. Hely Abstrct Pul Milgrom (The Bell Journl of Economics, 12(2): 380 391) showed tht if the strict monotone likelihood

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

13: Diffusion in 2 Energy Groups

13: Diffusion in 2 Energy Groups 3: Diffusion in Energy Groups B. Rouben McMster University Course EP 4D3/6D3 Nucler Rector Anlysis (Rector Physics) 5 Sept.-Dec. 5 September Contents We study the diffusion eqution in two energy groups

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

Vyacheslav Telnin. Search for New Numbers.

Vyacheslav Telnin. Search for New Numbers. Vycheslv Telnin Serch for New Numbers. 1 CHAPTER I 2 I.1 Introduction. In 1984, in the first issue for tht yer of the Science nd Life mgzine, I red the rticle "Non-Stndrd Anlysis" by V. Uspensky, in which

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

Hidden Markov Models

Hidden Markov Models Hidden Mrkov Models Huptseminr Mchine Lerning 18.11.2003 Referent: Nikols Dörfler 1 Overview Mrkov Models Hidden Mrkov Models Types of Hidden Mrkov Models Applictions using HMMs Three centrl problems:

More information

Week 10: Line Integrals

Week 10: Line Integrals Week 10: Line Integrls Introduction In this finl week we return to prmetrised curves nd consider integrtion long such curves. We lredy sw this in Week 2 when we integrted long curve to find its length.

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

Bayesian Networks: Approximate Inference

Bayesian Networks: Approximate Inference pproches to inference yesin Networks: pproximte Inference xct inference Vrillimintion Join tree lgorithm pproximte inference Simplify the structure of the network to mkxct inferencfficient (vritionl methods,

More information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes Jim Lmbers MAT 169 Fll Semester 2009-10 Lecture 4 Notes These notes correspond to Section 8.2 in the text. Series Wht is Series? An infinte series, usully referred to simply s series, is n sum of ll of

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

Best Approximation. Chapter The General Case

Best Approximation. Chapter The General Case Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given

More information

Applicable Analysis and Discrete Mathematics available online at

Applicable Analysis and Discrete Mathematics available online at Applicble Anlysis nd Discrete Mthemtics vilble online t http://pefmth.etf.rs Appl. Anl. Discrete Mth. 4 (2010), 23 31. doi:10.2298/aadm100201012k NUMERICAL ANALYSIS MEETS NUMBER THEORY: USING ROOTFINDING

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

Intuitionistic Fuzzy Lattices and Intuitionistic Fuzzy Boolean Algebras

Intuitionistic Fuzzy Lattices and Intuitionistic Fuzzy Boolean Algebras Intuitionistic Fuzzy Lttices nd Intuitionistic Fuzzy oolen Algebrs.K. Tripthy #1, M.K. Stpthy *2 nd P.K.Choudhury ##3 # School of Computing Science nd Engineering VIT University Vellore-632014, TN, Indi

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s).

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s). Mth 1A with Professor Stnkov Worksheet, Discussion #41; Wednesdy, 12/6/217 GSI nme: Roy Zho Problems 1. Write the integrl 3 dx s limit of Riemnn sums. Write it using 2 intervls using the 1 x different

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Lecture 19: Continuous Least Squares Approximation

Lecture 19: Continuous Least Squares Approximation Lecture 19: Continuous Lest Squres Approximtion 33 Continuous lest squres pproximtion We begn 31 with the problem of pproximting some f C[, b] with polynomil p P n t the discrete points x, x 1,, x m for

More information

x = b a N. (13-1) The set of points used to subdivide the range [a, b] (see Fig. 13.1) is

x = b a N. (13-1) The set of points used to subdivide the range [a, b] (see Fig. 13.1) is Jnury 28, 2002 13. The Integrl The concept of integrtion, nd the motivtion for developing this concept, were described in the previous chpter. Now we must define the integrl, crefully nd completely. According

More information

Overview of Calculus I

Overview of Calculus I Overview of Clculus I Prof. Jim Swift Northern Arizon University There re three key concepts in clculus: The limit, the derivtive, nd the integrl. You need to understnd the definitions of these three things,

More information

Linear and Non-linear Feedback Control Strategies for a 4D Hyperchaotic System

Linear and Non-linear Feedback Control Strategies for a 4D Hyperchaotic System Pure nd Applied Mthemtics Journl 017; 6(1): 5-13 http://www.sciencepublishinggroup.com/j/pmj doi: 10.11648/j.pmj.0170601.1 ISSN: 36-9790 (Print); ISSN: 36-981 (Online) Liner nd Non-liner Feedbck Control

More information

Riemann Integrals and the Fundamental Theorem of Calculus

Riemann Integrals and the Fundamental Theorem of Calculus Riemnn Integrls nd the Fundmentl Theorem of Clculus Jmes K. Peterson Deprtment of Biologicl Sciences nd Deprtment of Mthemticl Sciences Clemson University September 16, 2013 Outline Grphing Riemnn Sums

More information

Stuff You Need to Know From Calculus

Stuff You Need to Know From Calculus Stuff You Need to Know From Clculus For the first time in the semester, the stuff we re doing is finlly going to look like clculus (with vector slnt, of course). This mens tht in order to succeed, you

More information

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

SUMMER KNOWHOW STUDY AND LEARNING CENTRE SUMMER KNOWHOW STUDY AND LEARNING CENTRE Indices & Logrithms 2 Contents Indices.2 Frctionl Indices.4 Logrithms 6 Exponentil equtions. Simplifying Surds 13 Opertions on Surds..16 Scientific Nottion..18

More information