Reinforcement Learning Based Dynamic Selection of Auxiliary Objectives with Preserving of the Best Found Solution

Reiforcemet Learig Based Dyamic Selectio of Auxiliary Objectives with Preservig of the Best Foud Solutio arxiv:1704.07187v1 [cs.ne] 24 Apr 2017 Abstract Efficiecy of sigle-objective optimizatio ca be improved by itroducig some auxiliary objectives. Ideally, auxiliary objectives should be helpful. However, i practice, objectives may be efficiet o some optimizatio stages but obstructive o others. I this paper we propose a modificatio of the EA+RL method which dyamically selects optimized objectives usig reiforcemet learig. The proposed modificatio prevets from losig the best foud solutio. We aalysed the proposed modificatio ad compared it with the EA+RL method ad Radom Local Search o XdivK, Geeralized OeMax ad LeadigOes problems. The proposed modificatio outperforms the EA+RL method o all problem istaces. It also outperforms the sigle objective approach o the most problem istaces. We also provide detailed aalysis of how differet compoets of the cosidered algorithms ifluece efficiecy of optimizatio. I additio, we preset theoretical aalysis of the proposed modificatio o the XdivK problem. I. INTRODUCTION Cosider sigle-objective optimizatio of a target objective by a evolutioary algorithm (EA). Commoly, efficiecy of EA is measured i oe of two ways. I the first oe efficiecy is defied as the umber of fitess fuctio evaluatios eeded to reach the optimum. I the secod oe efficiecy of EA is computed as the maximum target objective value obtaied withi the fixed umber of evaluatios. I this work we use the first way. Efficiecy of the target objective optimizatio ca be icreased by itroducig some auxiliary objectives [1] [5]. Ideally, auxiliary objectives should be helpful [2]. However, i practice, objectives ca be geerated automatically ad may be efficiet o some optimizatio stages but obstructive o others [6], [7]. We call such objectives o-statioary. Oe of the approaches to deal with such objectives is dyamic selectio of the best objective at the curret stage of optimizatio. The objectives may be selected radomly [8]. The better method is EA+RL which uses reiforcemet learig (RL) [9], [10]. It was theoretically show for a umber of optimizatio problems that EA+RL efficietly works with statioary objectives [11], [12]. However, theoretical aalysis of EA+RL with o-statioary objectives showed that EA+RL does ot igore obstructive objective o the XdivK problem [13]. Selectio of a iefficiet auxiliary objective causes losig of the best Iria Petrova, Aria Buzdalova ITMO Uiversity 49 Kroverkskiy av. Sait-Petersburg, Russia, 197101 Email: ireepetrova@yadex.com, abuzdalova@gmail.com foud solutio ad the algorithm eeds a lot of steps to fid a good solutio agai. Also EA+RL ca stuck i local optima while solvig the Geeralized OeMax problem with obstructive objectives [14]. I this paper we propose a modified versio of EA+RL ad aalyse it theoretically ad experimetally o XdivK, Geeralized OeMax ad LeadigOes problems. The rest of the paper is orgaized as follows. First, the EA+RL ad model problems with o-statioary objectives are described. Secod, a modificatio of the EA+RL method is proposed. The we experimetally aalyse the cosidered methods. Fially, we provide discussio ad theoretical explaatio of the achieved results. II. PRELIMINARIES I this sectio we describe the EA+RL method. Also we defie model problems ad o-statioary objectives used i this study. A. EA+RL method I reiforcemet learig (RL) a aget applies a actio to a eviromet. The the eviromet returs a umerical reward ad a represetatio of its state ad the process repeats. The goal of the aget is to maximize the total reward [9]. I the EA+RL method, EA is treated as a eviromet, selectio of a objective to be optimized correspods to a actio. The aget selects a objective, EA geerates ew populatio usig this objective ad returs some reward to the aget. The reward depeds o differece of the best target objective value i two subsequet geeratios. We cosider maximizatio problems. So the higher is the ewly obtaied target value, the higher is the reward. I recet theoretical aalysis of EA+RL with o-statioary objectives, radom local search (RLS) is used istead of EA, populatio cosists of a sigle idividual [13]. Idividuals are represeted as bit strigs, flip-oe-bit mutatio is used. If values of the selected objective computed o the ew idividual ad o the curret oe are equal, the ew idividual is accepted. The used reiforcemet learig algorithm is Q- learig. Therefore, the EA+RL algorithm i this case is called

RLS + Q-learig. The pseudocode of RLS + Q-learig is preseted i Algorithm 1. The curret state s is defied as the target objective value of the curret idividual. The reward is calculated as differece of the target objective values i two subsequet geeratios. I Q-learig, the efficiecy of selectig a objective h i a state s is measured by the value Q(s, a), which is updated dyamically after each selectio as show i lie 13 of the pseudocode, where α ad γ are the learig rate ad the discout factor correspodigly. Algorithm 1 RLS + Q-learig Algorithm 1: Idividual y a radom bit strig 2: Costruct set H of auxiliary objectives ad target objective 3: Q(s, h) 0 for each state s ad objective h H 4: while (Optimum of the target objective t is ot foud) do 5: Curret state s t(y) 6: Idividual y mutate y (flip a radom bit) 7: Objective h: Q(s, h) = max h H Q(s, h ) If Q-values are equal, objectives are selected equiprobably 8: if h(y ) h(y) the 9: y y 10: ed if 11: New state s t(y) 12: Reward r s s 13: Q(s, h) Q(s, h) + α(r + max h H Q(s, h ) Q(s, h)) 14: ed while B. Model problems I this paper we cosider three model problems which were used i studies of EA+RL [12] [14]. I all cosidered problems, a idividual is a bit strig of legth. Let x be the umber of bits i a idividual which are set to oe. The the objective ONEMAX is equal to x ad the objective ZEROMAX is equal to x. Oe of the cosidered problems is GENERALIZED ONE- MAX, deoted as OM d. The target objective of this problem called OM d is calculated as the umber of bits i a idividual of legth that matches a give bit mask. The bit mask has d 0-bits ad d 1-bits. Aother problem is XDIVK. The target objective is calculated as x k, where x is the umber of oes, k is a costat, k < ad k divides. The last cosidered problem is LEADINGONES. The target objective of this problem is equal to the legth of the maximal prefix cosistig of bits set to oe. C. No-statioary objectives We used the two followig o-statioary auxiliary objectives for all the cosidered problems. These auxiliary objectives ca be both ONEMAX or ZEROMAX at differet stages of optimizatio. More precisely, cosider the auxiliary objectives h 1 ad h 2 defied i (1). The parameter p is called a switch poit, because at this poit auxiliary objectives chage their properties. ONEMAX ad ZEROMAX are deoted as OM ad ZM correspodigly. { { OM, x p ZM, x p h 1 (x) = h 2 (x) = ZM, p < x OM, p < x (1) I the XDIVK problem, a objective which is curretly equal to ONEMAX allows to distiguish idividuals with the same value of the target objective ad give preferece to the idividual with a higher x value. Such a idividual is more likely to produce a descedat with a higher target objective value. I the LEADINGONES problem, ONEMAX is helpful because it has the same optimum but ruig time of optimizig ONEMAX is lower [15]. Therefore, i LEADINGONES ad XDIVK problems the objective which is equal to ONEMAX at the curret stage of optimizatio is helpful. I both these problems, the goal is to obtai idividual of all oes, so ZEROMAX is obstructive. I the OM d problem, both objectives may be obstructive or eutral. For example, if for some i the i-th bit of the OM d bit mask is set to 1, ad the mutatio operator flips the i-th bit of a idividual from 0 to 1, the ZEROMAX objective is obstructive, because it would ot accept this idividual. I the iverse case, ONEMAX is obstructive. III. MODIFIED EA+RL I this sectio we propose a modificatio of the EA+RL method which prevets EA+RL from losig the best foud solutio. I the EA+RL method, if the ewly geerated idividual is better tha the existig oe accordig to the selected objective, the ew idividual is accepted. However, if the selected objective is obstructive, the ew idividual may be worse tha the existig idividual i terms of the target objective. I this case EA loses the idividual with the best target objective value. I the modified EA+RL, if the ewly geerated idividual is better tha the existig oe accordig to the selected objective, but is worse accordig to the target objective, the ew idividual is rejected. As i the recet theoretical works, we use RLS as optimizatio problem solver ad apply Q- learig to select objectives. The pseudocode of the modified RLS + Q-learig is preseted i Algorithm 2. To motivate the approach of reward calculatio i the ew method, we eed to describe how the aget lears which objective should be selected i the EA+RL method. If the aget selects a obstructive objective ad EA loses idividual with the best target value, the best target value i the ew geeratio is decreased. So the aget achieves a egative reward for this objective ad will ot select this objective i the same state later. However, whe properties of auxiliary objectives are chaged, the obstructive objective may become helpful. If the properties are chaged withi oe RL state, the objective which became helpful will ot be selected, because the aget previously achieved egative reward for it. Ad iversely, the objective which became obstructive could be selected because it was helpful earlier. Ad, as it was show i [13], the EA+RL eeds a lot of steps to get out of this

Algorithm 2 Modified RLS + Q-learig Algorithm 1: Idividual y a radom bit strig 2: Costruct set H of auxiliary objectives ad target objective 3: Q(s, h) 0 for each state s ad objective h H 4: while (Optimum of the target objective t is ot foud) do 5: Calculate curret state s 6: Save target fitess: f t(y) 7: Idividual y mutate y (flip a radom bit) 8: Objective h: Q(s, h) = max h H Q(s, h ) If Q-values are equal, objectives are selected equiprobably 9: if h(y ) h(y) ad t(y ) t(y) the 10: y y 11: ed if 12: Calculate ew state s 13: Calculate reward r 14: Q(s, h) Q(s, h) + α(r + max h H Q(s, h ) Q(s, h)) 15: ed while trap. For this reaso, i the ew method we cosider two versios of the modified EA+RL with differet ways of reward calculatio. I the first versio of the modified EA+RL, if the ew idividual is better tha the curret oe accordig to the selected objective, ad its target objective value is lower, the ew idividual is rejected but the aget achieves egative reward as if the ew idividual was accepted. So the aget lears as i EA+RL. We call this algorithm the modificatio of EA+RL with learig o mistakes. I this case i lie 13 of Algorithm 2 reward is calculated as preseted i Algorithm 3. I the secod versio the aget i the same situatio achieves zero reward because the idividual i the populatio is ot chaged. Thereby, aget does ot lear if the actio was iefficiet ad lears oly if the target objective value was icreased. We call this algorithm the modificatio of EA+RL without learig o mistakes. I this case i lie 13 of Algorithm 2 reward is calculated as t(y) f. Algorithm 3 Reward calculatio i the modified EA+RL with learig o mistakes (Algorithm 2, lie 13) 1: r t(y) f 2: if h(y ) h(y) the 3: r t(y ) f 4: ed if I the existig theoretical works o EA+RL, RL state is defied as the target objective value [11], [14]. Deote it as target state. However, if the idividual with the best target objective value is preserved, the algorithm will ever retur to the state where it achieved positive reward. So the aget ever kows which objective is helpful. It oly ca lear that a objective is obstructive if the aget achieved egative reward for it. Therefore, i the preset work we cosider two defiitios of a state. The first oe is the target state. The secod oe is the sigle state which is the same i the whole optimizatio process. This state is used to ivestigate efficiecy of the proposed method whe the aget has leared which objective is good. IV. THEORETICAL ANALYSIS Previously, it was show that the EA+RL method gets stuck i local optima o XdivK with o-statioary objectives [13]. Below we preset theoretical aalysis of the ruig time of the proposed EA+RL modificatio without learig o mistakes o this problem. The target state is used. To compute the expected ruig time of the algorithm, we costruct the Markov chai that represets the correspodig optimizatio process [11], [13]. Recall that RL states are determied by the target objective value. Markov states correspod to the umber of 1-bits i a idividual. Therefore, a RL state icludes k Markov states with differet umber of 1-bits. The Markov chai for the XDIVK problem is show i the Fig. 1. The labels o trasitios have the format F, M, where F is a fitess fuctio that ca be chose for this trasitio, M is the correspodig effect of mutatio. A trasitio probability is computed as the sum for all f F of product of probabilities of selectio f ad the correspodig mutatio m. Further we describe each Markov state ad explai why the trasitios have such labels. p = dk RL state (d+1) RL state d h 1 = ZeroMax h 2 = OeMax h 1 = OeMax h 2 = ZeroMax RL state (d-1) h 1, 0 1 h 2, 1 0 t, 1 0 h 1, 1 0 h 1, 1 0 h 2, 0 1 h 2, 1 0 t, 1 0... dk+k... dk+2 dk+1 dk t, 0 1 h 2, 0 1 dk-1 t, 0 1 h 1, 0 1 Fig. 1. Markov chai for the EA+RL modificatio without learig o mistakes o XDIVK Cosider the umber of oes equal to dk, where d is a costat. So the aget is i the RL state d ad the Markov

state is dk. Sice the aget has o experiece i the state d, the objectives are selected equiprobably. If the aget selects the target objective or a objective which is curretly equal to ONEMAX, ad the mutatio operator iverts 1-bit, the ew idividual has dk 1 1-bits ad is worse tha the curret idividual accordig to the selected objective. So the ew idividual will ot be accepted by EA. The same situatio occurs, if the selected objective is equal to ZEROMAX ad the mutatio operator iverts 0-bit. I the case of selectio of the ZEROMAX objective ad iversio of 1-bit, the ew idividual is better tha the curret oe accordig to the selected objective. However, the target objective value of the ew idividual is less tha the target objective value of the curret idividual. Therefore, the ew idividual is ot selected for the ext geeratio. If ONEMAX or the target objective is selected ad 0-bit is iverted, the ew idividual is accepted. The trasitios i Markov states dk ad dk + 1 differ from each other whe 1-bit is iverted ad the aget selects the target objective or the objective which is equal to ZEROMAX. I this case, the ew idividual is equal to or better tha the curret oe accordig to the selected objective. So the ew idividual is accepted. However, the ew idividual cotais less 1-bits tha the curret oe, so the algorithm moves to the dk Markov state. Trasitios i the states dk+2,..., dk+k 1 are the same as trasitios i the state dk+1. From the Markov chai of the algorithm, we ca see, that trasitios ad, as a cosequece, performace of the algorithm do ot deped o the umber ad positios of switch poits. To aalyse the ruig time of the cosidered algorithm, we also eed to costruct Markov chai for RLS without auxiliary objectives (see Fig. 2). This Markov chai is costructed aalogically to the Markov chai described above. The expected ruig time of the EA+RL modificatio without learig o mistakes for XDIVK with o-statioary objectives is equal to the umber of fitess fuctio evaluatios eeded to get from the Markov state 0 to the Markov state. Each trasitio i the Markov chai correspods to oe fitess fuctio evaluatio of the mutated idividual. So the expected ruig time is equal to the umber of trasitios i the Markov chai. Deote the expected ruig time of the algorithm as T (): 1 T () = E(i i + 1), (2) i=0 where E(i i + 1) is the expected umber of trasitios eeded to reach the Markov state i + 1 from the state i. Cosider two cases for the state i. The first is i = dk, where d is a costat. The expected umber of trasitios eeded to reach the state dk + 1 from the state dk is evaluated as z dk = E(dk dk + 1): z dk = 2 3 ( dk) 1 + ( 2 3 dk + 1 3 ) (1 + z dk) (3) Fig. 2. t, 1 0 t, 1 0... dk+k... dk+2 dk+1 dk dk-1 t, 0 1 t, 0 1 Markov chai for RLS without auxiliary objectives o XDIVK From (3) we obtai that z dk is evaluated as: z dk = 2( dk) The secod case is i = dk + t, where 1 t k 1. The expected umber of trasitios eeded to reach the state dk + t + 1 from the state dk + t is evaluated as z dk+t = E(dk + t dk + t + 1): 2( dk t) 2(dk + t) z dk+t = + (1 + z dk+t 1 + z dk+t )+ +( dk + t + dk t ) (1 + z dk+t ) (5) From (5) we obtai that z dk+t is evaluated as: dk + t z dk+t = z dk+t 1 dk t + (6) 2( dk t) To estimate the efficiecy of the aalysed EA+RL modificatio, let us calculate the expected ruig time of RLS without auxiliary objectives. The evaluatio approach is similar to the oe preseted above for the EA+RL modificatio. The total ruig time is calculated by (2). Aalogically, we cosider two cases: i = dk ad i = dk + t. The expected umber of trasitios eeded to reach the state dk + 1 from the state dk is evaluated as a dk = E(dk dk + 1): ( dk) a dk = 1 + dk (1 + a dk) (7) From (7) we obtai that a dk is evaluated as: a dk = (8) ( dk) (4)

The expected value of trasitios eeded to reach the state dk + t + 1 from the state dk + t is evaluated as a dk+t = E(dk + t dk + t + 1): ( dk t) a dk+t = 1+ dk + t (1+a dk+t 1 +a dk+t ) (9) From (9) we obtai that a dk+t is evaluated as: a dk+t = a dk+t 1 dk + t dk t + ( dk t) (10) From (4) ad (8) we obtai that z dk = 3 2 a dk. From the equatios (2), (6), (10) usig mathematical iductio we obtai that the ruig time of EA+RL modificatio without learig o mistakes for XDIVK with o-statioary objectives is 1.5 times greater tha the ruig time of RLS without auxiliary objective. Therefore, the EA+RL modificatio without learig o mistakes has asymptotically the same ruig time as RLS, which is bouded by Ω( k ) ad O( k+1 ) [11]. This shows that the EA+RL modificatio without learig o mistakes ca deal with o-statioary auxiliary objectives ulike EA+RL does. V. EXPERIMENTAL ANALYSIS OF MODIFIED EA+RL I this sectio we experimetally evaluate efficiecy of the both proposed EA+RL modificatios o the optimizatio problems defied i Sectio II-B. A. Descriptio of experimets We aalysed the two versios of the proposed modificatio of EA+RL ad the EA+RL method o OM d, XDIVK ad LEADINGONES. The o-statioary objectives described i (1) were used for all the problems. For XDIVK we aalysed two cases of the switch poit positio. The first case is the worst case [13], whe the switch poit is i the ed of optimizatio, p = k+1. I the secod case, the switch poit is i the middle of optimizatio, p = /2. For each algorithm we aalysed two state defiitios: the sigle state ad the target state. Also we studied applyig of ε-greedy strategy. I this strategy the aget selects the objective with the maximum expected reward with probability 1 ε ad with probability ε the aget selects a radom objective. This strategy gives the aget a opportuity to select the objective which was iefficiet but became efficiet after the switch poit. The obtaied umbers of fitess fuctio evaluatios eeded to reach the optimum were averaged by 1000 rus. We used Q-learig algorithm with the same parameters as i [13]. The learig rate was set to α = 0.5 ad the discout factor was equal to γ = 0.5. The ε-greedy strategy was used with ε = 0.1. B. Discussio of experimet results Table I presets the results of the experimets. The first colum cotais parameter values for the cosidered problems. The ext colum cotais results of RLS without auxiliary objectives. The ext three colums correspod to results of EA+RL modificatio with learig o mistakes (modified EA+RL, learig). The followig three colums correspod to the results of EA+RL modificatio without learig o mistakes (modified EA+RL, o learig). The last three colums correspod to the results of the EA+RL method (EA+RL). Each algorithm was aalysed o the sigle state (ss, ε = 0 ad ss, ε = 0.1) ad the target state (ts, ε = 0 ad ts, ε = 0.1). Noe of the algorithms reached the optimum usig the sigle state ad ε = 0. So these results are ot preseted. Whe the optimum was ot reached withi 10 9 iteratios the correspodig result is marked as if. We ca see from Table I that the modificatio of EA+RL with learig o mistakes usig the sigle state ad ε = 0.1 is the most efficiet algorithm o LEADINGONES, OM d ad XDIVK with switch poit i the middle. O LeadigOes ad XdivK with switch poit i the middle this algorithm igores a iefficiet objective ad selects efficiet oe, so the achieved results are better tha the results of RLS without objectives. O OM d this modificatio igores obstructive objectives ad achieves the same results as RLS. O XdivK with switch poit i the ed, the best results are achieved usig the modificatio of EA+RL without learig o mistakes. For each problem, we picked the best cofiguratio of each algorithm ad compared them by Ma-Whitey test with Boferroi correctio. The algorithms were statistically distiguishable with p-value less tha 0.05. Below we aalyse how differet compoets of the cosidered methods ifluece optimizatio performace. More precisely, we cosider ifluece of the best idividual preservatio, learig which objective is obstructive, state defiitio ad ε-greedy strategy. 1) Ifluece of learig which objective is obstructive: We ca see from the results that the modificatio of EA+RL without learig is outperformed by the versio with learig o LEADINGONES ad XDIVK with switch poit i the middle. Therefore, learig o mistakes is useful because it allows the aget to remember that the objective is obstructive ad ot to select it further. However, o XdivK problem with switch poit i the ed, the best results are achieved usig modificatio of EA+RL without learig o mistakes. Below we explai why learig o mistakes is ot always efficiet. If some objective becomes helpful, the aget will ot select this objective i the same state because it obtaied a egative reward for it. So there are two ways for the aget to lear that the obstructive objective became helpful. The first way is to select this objective with ε-probability ad achieve a positive reward. The secod way is to move to the ew state, where the aget has ot leared which objective is efficiet. It is impossible whe usig the sigle state. Cosider what actios aget should do to move to the ew state if the target state is used. The oly way to move to the ew state is icreasig of the target objective value. The aget could select the target objective or the objective which was efficiet but became obstructive. So target objective value ca be icreased oly if the ew idividual has a higher target objective value ad the selected objective is the target oe. I some optimizatio problems it is ot always possible to icrease the target objective i oe iteratio of the algorithm. For example, cosider XDIVK problem. Let the umber of 1-

TABLE I AVERAGED NUMBER OF RUNS NEEDED TO REACH THE GLOBAL OPTIMUM modified EARL, learig modified EARL, o learig EARL Parameters RLS ss, ε = 0.1 ts, ε = 0 ts, ε = 0.1 ss, ε = 0.1 ts, ε = 0 ts, ε = 0.1 ss, ε = 0.1 ts, ε = 0 ts, ε = 0.1 LeadigOes 1 1.46 1.65 1.72 1.76 1.77 1.72 1.77 if if if 11 6.20 10 1 7.00 10 1 6.63 10 1 7.05 10 1 6.97 10 1 9.07 10 1 9.20 10 1 if if if 21 2.21 10 2 2.09 10 2 2.06 10 2 2.15 10 2 2.54 10 2 3.27 10 2 3.31 10 2 if if if 31 4.76 10 2 3.97 10 2 4.22 10 2 4.43 10 2 5.67 10 2 7.24 10 2 7.23 10 2 if if if 41 8.45 10 2 6.71 10 2 7.04 10 2 7.42 10 2 1.02 10 3 1.26 10 3 1.25 10 3 if if if 51 1.29 10 3 9.39 10 2 1.06 10 3 1.13 10 3 1.59 10 3 1.96 10 3 1.94 10 3 if if if 61 1.86 10 3 1.27 10 3 1.48 10 3 1.58 10 3 2.36 10 3 2.78 10 3 2.77 10 3 if if if 71 2.53 10 3 1.57 10 3 1.96 10 3 2.10 10 3 3.23 10 3 3.79 10 3 3.79 10 3 if if if 81 3.28 10 3 1.92 10 3 2.52 10 3 2.65 10 3 4.28 10 3 4.94 10 3 4.87 10 3 if if if 91 4.15 10 3 2.35 10 3 3.16 10 3 3.35 10 3 5.51 10 3 6.16 10 3 6.14 10 3 if if if 101 5.07 10 3 2.78 10 3 3.81 10 3 4.04 10 3 6.85 10 3 7.63 10 3 7.65 10 3 if if if 111 6.11 10 3 3.11 10 3 4.60 10 3 4.92 10 3 8.38 10 3 9.31 10 3 9.19 10 3 if if if 121 7.33 10 3 3.68 10 3 5.43 10 3 5.79 10 3 1.00 10 4 1.10 10 4 1.10 10 4 if if if 131 8.60 10 3 4.17 10 3 6.36 10 3 6.73 10 3 1.17 10 4 1.29 10 4 1.28 10 4 if if if 141 1.00 10 4 4.61 10 3 7.20 10 3 7.80 10 3 1.36 10 4 1.49 10 4 1.49 10 4 if if if 151 1.13 10 4 5.08 10 3 8.33 10 3 8.90 10 3 1.57 10 4 1.72 10 4 1.72 10 4 if if if 161 1.30 10 4 5.44 10 3 9.39 10 3 1.01 10 4 1.81 10 4 1.94 10 4 1.96 10 4 if if if 171 1.45 10 4 6.04 10 3 1.06 10 4 1.13 10 4 2.05 10 4 2.18 10 4 2.19 10 4 if if if 181 1.65 10 4 6.60 10 3 1.18 10 4 1.27 10 4 2.29 10 4 2.47 10 4 2.46 10 4 if if if 191 1.81 10 4 7.28 10 3 1.33 10 4 1.41 10 4 2.58 10 4 2.73 10 4 2.73 10 4 if if if, d OMd 100, 50 4.51 10 2 4.93 10 2 5.65 10 2 5.69 10 2 6.49 10 2 6.75 10 2 6.81 10 2 200, 100 1.04 10 3 1.09 10 3 1.26 10 3 1.31 10 3 1.47 10 3 1.55 10 3 1.57 10 3 300, 150 1.72 10 3 1.74 10 3 2.03 10 3 2.05 10 3 2.40 10 3 2.51 10 3 2.51 10 3 400, 200 2.43 10 3 2.43 10 3 2.80 10 3 2.90 10 3 3.42 10 3 3.56 10 3 3.53 10 3 500, 250 3.12 10 3 3.16 10 3 3.65 10 3 3.72 10 3 4.34 10 3 4.58 10 3 4.60 10 3, k XdivK, switch poit i the ed 40, 2 1.19 10 3 3.20 10 3 3.67 10 3 3.25 10 3 2.52 10 3 1.73 10 3 1.77 10 3 if 3.72 10 3 2.28 10 4 48, 2 1.70 10 3 4.46 10 3 5.05 10 3 4.31 10 3 3.73 10 3 2.37 10 3 2.49 10 3 if 5.23 10 3 4.86 10 4 56, 2 2.30 10 3 5.95 10 3 6.69 10 3 6.45 10 3 4.92 10 3 3.50 10 3 3.46 10 3 if 6.81 10 3 1.59 10 5 64, 2 2.98 10 3 7.86 10 3 8.48 10 3 7.78 10 3 6.31 10 3 4.47 10 3 4.55 10 3 if 8.91 10 3 3.75 10 5 72, 2 3.76 10 3 9.42 10 3 1.12 10 4 1.03 10 4 8.05 10 3 5.60 10 3 5.47 10 3 if 1.11 10 4 1.38 10 6 80, 2 4.62 10 3 1.25 10 4 1.33 10 4 1.30 10 4 9.38 10 3 6.74 10 3 7.07 10 3 if 1.41 10 4 3.56 10 6 60, 3 3.94 10 4 2.44 10 5 2.79 10 5 2.53 10 5 7.10 10 4 5.82 10 4 5.95 10 4 if 2.95 10 5 3.16 10 7 72, 3 6.79 10 4 4.18 10 5 4.93 10 5 4.19 10 5 1.15 10 5 1.03 10 5 1.03 10 5 if 4.98 10 5 3.30 10 8 84, 3 1.08 10 5 6.57 10 5 7.69 10 5 6.55 10 5 1.72 10 5 1.61 10 5 1.64 10 5 if 7.81 10 5 if 96, 3 1.60 10 5 9.91 10 5 1.21 10 6 9.97 10 5 2.59 10 5 2.39 10 5 2.44 10 5 if 1.05 10 6 if 108, 3 2.28 10 5 1.34 10 6 1.75 10 6 1.45 10 6 3.49 10 5 3.34 10 5 3.34 10 5 if 1.63 10 6 if 120, 3 3.12 10 5 1.93 10 6 2.32 10 6 1.97 10 6 4.87 10 5 4.70 10 5 4.76 10 5 if 2.37 10 6 if, k XdivK, switch poit i the middle 40, 2 1.19 10 3 9.06 10 2 6.63 10 2 7.13 10 2 2.46 10 3 1.73 10 3 1.76 10 3 if 8.28 10 2 3.57 10 3 48, 2 1.70 10 3 1.22 10 3 9.76 10 2 1.01 10 3 3.41 10 3 2.37 10 3 2.61 10 3 if 1.18 10 3 9.59 10 3 56, 2 2.30 10 3 1.50 10 3 1.28 10 3 1.35 10 3 4.76 10 3 3.50 10 3 3.45 10 3 if 1.56 10 3 2.43 10 4 64, 2 2.98 10 3 1.84 10 3 1.66 10 3 1.77 10 3 6.09 10 3 4.47 10 3 4.23 10 3 if 2.04 10 3 7.08 10 4 72, 2 3.76 10 3 2.13 10 3 2.03 10 3 2.23 10 3 7.61 10 3 5.60 10 3 5.41 10 3 if 2.55 10 3 1.91 10 5 80, 2 4.62 10 3 2.48 10 3 2.61 10 3 2.75 10 3 9.66 10 3 6.74 10 3 6.56 10 3 if 3.02 10 3 4.76 10 5 60, 3 3.94 10 4 6.61 10 3 1.06 10 4 1.20 10 4 7.02 10 4 5.82 10 4 5.76 10 4 if 1.17 10 4 1.52 10 6 72, 3 6.79 10 4 1.12 10 4 1.78 10 4 2.11 10 4 1.19 10 5 1.03 10 5 1.01 10 5 if 2.00 10 4 1.43 10 7 84, 3 1.08 10 5 1.79 10 4 3.02 10 4 3.35 10 4 1.73 10 5 1.61 10 5 1.64 10 5 if 3.08 10 4 1.57 10 8 96, 3 1.60 10 5 3.01 10 4 4.13 10 4 5.15 10 4 2.60 10 5 2.39 10 5 2.44 10 5 if 4.75 10 4 if 108, 3 2.28 10 5 4.54 10 4 5.86 10 4 6.71 10 4 3.74 10 5 3.34 10 5 3.43 10 5 if 6.88 10 4 if 120, 3 3.12 10 5 6.46 10 4 8.25 10 4 9.37 10 4 4.98 10 5 4.70 10 5 4.59 10 5 if 9.32 10 4 if bits be dk, so the RL state is d. To move to the state d+1, the algorithm eeds to mutate k 0-bits. Let switch poit p be equal to dk + l, where 0 < l < k. The if the umber of 1-bits is greater tha dk+l, the algorithm ca icrease the umber of 1- bits oly if 0-bit is mutated ad the target objective is selected. However, whatever bit is mutated ad whatever objective is selected, the target objective value stays uchaged util a idividual with dk + k 1-bits is obtaied. So the aget does ot recogize if its actio is good or bad because the reward is equal to zero. Therefore, to icrease the target objective value algorithm eeds a lot of steps. I additio, if the switch poit is i the ed of optimizatio, probability to mutate a 0-bit ad select the target objective at the same time is low. The worst case is whe the switch poit is equal to k + 1, because the aget eeds to select the target objective ad icrease the umber of 1-bits durig k 1 iteratios. The modificatio

of EA+RL without learig o mistakes does ot have this drawback, because as it was show i theoretical aalysis, it does ot deped o the umber of switch poits ad their positios. To coclude, the modificatio of EA+RL with learig performs better tha the proposed modificatio without learig, except the case whe the switch poit occurs at the time whe it is hard to geerate good solutios (stagatio). 2) Ifluece of best idividual preservatio: Thaks to preservig of idividual with the best target value, the proposed modificatio of EA+RL achieves the optimum o LEADINGONES ad OM d problems ulike the EA+RL method does. However, we ca see that o XdivK with switch poit i the middle EA+RL achieves better results despite the possibility to lose the best idividual. It is explaied by the fact that EA+RL lears which objective is obstructive ad for this problem it is more importat tha preservig of the best idividual. Iverse situatio is observed for XdivK with switch poit i the ed where learig does ot help to achieve the optimum faster eve i case of preservig of the best idividual (see sectio V-B1). To coclude, preservatio of the best idividual improves the EA+RL method. However, learig which objective is efficiet has a greater impact o the algorithm performace. 3) Ifluece of state defiitio: To begi with, let us separately cosider RL efficiecy whe usig the target state ad the sigle state. The we will discuss how state defiitio ifluece the performace of two proposed modificatios of EA+RL. Cosider the target state. The algorithm moves to the ew state if the target objective value is icreased ad, as follows, a positive reward is obtaied. If the target objective value ca ot be decreased, the algorithm ever returs to the state where it achieved a positive reward. Therefore, the aget ever kows which actio is efficiet if the target state is used. The sigle state allows to lear which objective is helpful cotrary to the target state. However, whe the objective which was efficiet becomes obstructive, the aget cotiues selectio of this objective. Re-learig that the objective which was helpful became obstructive whe usig the sigle state is more difficult tha if the target state is used (see Sectio V-B1). Noe of the algorithms usig the sigle state with ε = 0 reached the optimum. So without ε-greedy strategy the aget could ot re-lear. Also EA+RL without preservatio of the best obtaied idividual does ot achieve the optimum usig the sigle state. Cosider ifluece of state defiitio i the EA+RL modificatio with learig. Whe solvig LEADINGONES, the positive effect of ability to lear which objective is efficiet is more importat tha the egative effect of difficult re-learig. We ca see differet iflueces of these egative ad positive effects whe solvig the XdivK problem with switch poit i the middle. Whe k becomes bigger, it is more importat to select a efficiet objective durig k iteratios where aget does ot achieve reward ad could ot defie which objective is better. So we ca see that for k = 3 the sigle state is much better tha the target state. O XdivK problem with switch poit i the ed, the results obtaied usig the sigle state ad the target state with ε = 0.1 are the same. These results are better tha the results obtaied usig the target state with ε = 0. Therefore, the impact of ε is greater tha the impact of state defiitio. I the EA+RL modificatio without learig the aget does ot achieve egative reward so re-learig whe usig the sigle state is harder. Re-learig ca occur oly if a positive reward is obtaied whe a better solutio is geerated. I XdivK, may iteratios are eeded to icrease the target objective value (see Sectio V-B1). Therefore, the results o XdivK for the target state are better tha the results for the sigle state. I LEADINGONES, geeratig of a idividual with a higher target objective value is ot so difficult. Therefore, usig the sigle state the algorithm obtais better results tha usig the target state. To coclude, i the modificatio of EA+RL with learig o mistakes the sigle state is better i the most cases. I the modificatio of EA+RL without learig the sigle state is worse tha the target oly o XdivK where re-learig is difficult. Also we ca ote that EA+RL does ot achieve the optimum usig the sigle state. 4) Ifluece of ε-greedy strategy: Cosider ifluece of ε value o performace of the cosidered algorithms whe usig the target state. I this state the aget ca oly lear that a objective is obstructive, if a egative reward is achieved after applyig this objective. So this objective will ot be selected i the same state eve if it will become helpful. No-zero ε allows to select this objective. I the modificatio of EA+RL without learig o mistakes the aget ca obtai oly oegative reward. So the situatio described above is impossible i this modificatio. Therefore, ε value does ot ifluece the efficiecy of this modificatio. I the modificatio of EA+RL with learig o mistakes o-zero ε is helpful oly o XdivK problem with switch poit i the ed. O the other problems the results obtaied with o-zero ε are worse tha the results obtaied with ε = 0. Whe usig the sigle state, ε allows to select the objective which was obstructive, as if the target state was used. However, if the sigle state is used, the aget lears ot oly that a objective is iefficiet, but also that a objective is efficiet. Therefore, ε also allows ot to select the objective which was efficiet but became obstructive. This results i reachig the optimum whe usig o-zero ε i the sigle state. To coclude, if the sigle state is used, ε value have to be o-zero. If the target state is used, o-zero ε allows to achieve better results oly if re-learig is very difficult, such as i XdivK with switch poit i the ed. VI. CONCLUSION We proposed a modificatio of the EA+RL method which preserves the best foud solutio. We cosidered two versios of the proposed modificatio. I the first versio, called the modificatio of EA+RL without learig o mistakes, the RL aget lears oly whe the algorithm fids a better solutio.

I the secod versio, called the modificatio of EA+RL with learig o mistakes, the RL aget also lears whe the algorithm obtais a iefficiet solutio. We cosidered two auxiliary objectives which chage their efficiecy at switch poit. We experimetally aalysed the two proposed modificatios ad the EA+RL method o OM d, LEADINGONES, XDIVK with switch poit i the middle of optimizatio ad XDIVK with switch poit i the ed. Two RL state defiitios were cosidered: the sigle state ad the target state. Also we cosidered how ε-greedy exploratio strategy ifluece the performace of the algorithm. The both proposed modificatios reached the optimum o OM d ad LEADINGONES ulike the EA+RL method did. The modificatio of EA+RL with learig usig the sigle state ad ε = 0.1 achieved the best results amog the cosidered objective selectio methods o all problems, except the XDIVK problem with switch poit i the ed. O LEADINGONES ad XDIVK with switch poit i the middle this algorithm was able to select a efficiet objective ad obtai better results tha RLS. O OM d without helpful objectives, this modificatio igored obstructive objectives ad achieved the same results as RLS. Therefore, keepig the best idividual ad usig ε- greedy exploratio i the sigle state seems to be the most promisig reiforcemet based objective selectio approach. We theoretically proved that the lower ad upper bouds o the ruig time of the modificatio of EA+RL without learig o mistakes o the XDIVK problem are Ω( k ) ad O( k+1 ) correspodigly. The asymptotic of RLS o XDIVK without auxiliary objectives is the same. This meas that the modificatio of EA+RL without learig o mistakes o the XDIVK problem igores the objective which is curretly obstructive. Also we proved that performace of this modificatio is idepedet of the umber of switch poits ad their positios, while performace of the modificatio with learig [2] J. D. Kowles, R. A. Watso, ad D. Core, Reducig local optima i sigle-objective problems by multi-objectivizatio, i Proceedigs of the First Iteratioal Coferece o Evolutioary Multi-Criterio Optimizatio. Spriger-Verlag, 2001, pp. 269 283. [3] F. Neuma ad I. Wegeer, Ca sigle-objective optimizatio profit from multiobjective optimizatio? i Multiobjective Problem Solvig from Nature, ser. Natural Computig Series. Spriger Berli Heidelberg, 2008, pp. 115 130. [4] D. Brockhoff, T. Friedrich, N. Hebbighaus, C. Klei, F. Neuma, ad E. Zitzler, O the effects of addig objectives to plateau fuctios, IEEE Trasactios o Evolutioary Computatio, vol. 13, o. 3, pp. 591 603, 2009. depeds o these factors. Particularly, the modificatio without learig achieves the best results o XdivK with switch poit i the ed. This is a especially difficult case because icreasig of the target objective value i the ed of optimizatio eeds a lot of iteratios. VII. ACKNOWLEDGMENTS This work was supported by RFBR accordig to the research project No. 16-31-00380 mol a. REFERENCES [1] C. Segura, C. A. C. Coello, G. Mirada, ad C. Léo, Usig multiobjective evolutioary algorithms for sigle-objective optimizatio, 4OR, vol. 3, o. 11, pp. 201 228, 2013. [5] J. Hadl, S. C. Lovell, ad J. D. Kowles, Multiobjectivizatio by decompositio of scalar cost fuctios, i Parallel Problem Solvig from Nature PPSN X, ser. Lecture Notes i Computer Sciece. Spriger, 2008, o. 5199, pp. 31 40. [6] M. Buzdalov ad A. Buzdalova, Adaptive selectio of helper-objectives for test case geeratio, i 2013 IEEE Cogress o Evolutioary Computatio, vol. 1, 2013, pp. 2245 2250. [7] D. F. Lochtefeld ad F. W. Ciarallo, Helper-objective optimizatio strategies for the Job-Shop schedulig problem, Applied Soft Computig, vol. 11, o. 6, pp. 4161 4174, 2011. [8] M. T. Jese, Helper-objectives: Usig multi-objective evolutioary algorithms for sigle-objective optimisatio: Evolutioary computatio combiatorial optimizatio, Joural of Mathematical Modellig ad Algorithms, vol. 3, o. 4, pp. 323 347, 2004. [9] R. S. Sutto ad A. G. Barto, Reiforcemet Learig: A Itroductio. Cambridge, MA, USA: MIT Press, 1998. [10] A. Buzdalova ad M. Buzdalov, Icreasig efficiecy of evolutioary algorithms by choosig betwee auxiliary fitess fuctios with reiforcemet learig, i Proceedigs of the Iteratioal Coferece o Machie Learig ad Applicatios, vol. 1, 2012, pp. 150 155. [11] M. Buzdalov ad A. Buzdalova, OeMax helps optimizig XdivK: Theoretical rutime aalysis for RLS ad EA+RL, i Proceedigs of Geetic ad Evolutioary Computatio Coferece Compaio. ACM, 2014, pp. 201 202. [12], Ca OeMax help optimizig LeadigOes usig the EA+RL method? i Proceedigs of IEEE Cogress o Evolutioary Computatio, 2015, pp. 1762 1768. [13] I. Petrova, A. Buzdalova, ad G. Koreev, Rutime aalysis of radom local search with reiforcemet based selectio of o-statioary auxiliary objectives: iitial study, i Proceedigs of 22d Iteratioal Coferece o Soft Computig MENDEL 2016, Czech Republic, 2016, pp. 95 102. [14] A. Buzdalova, I. Petrova, ad M. Buzdalov, Rutime aalysis of differet approaches to select coflictig auxiliary objectives i the geeralized oemax problem, i Proceedigs of IEEE Symposium Series o Computatioal Itelligece, 2016, pp. 280 286. [15] A. Auger ad B. Doerr, Theory of Radomized Search Heuristics: Foudatios ad Recet Developmets. Scietific Publishig Co., Ic., 2011. River Edge, NJ, USA: World