Optimistic Planning with Long Sequences of Identical Actions: An Extended Theoretical and Experimental Study

Size: px

Start display at page:

Download "Optimistic Planning with Long Sequences of Identical Actions: An Extended Theoretical and Experimental Study"

Noah Flowers
5 years ago
Views:

1 Volume 56, Number 1-2, Optimistic Planning with Long Sequences of Ientical Actions: An Extene Theoretical an Experimental Stuy Koppány Máthé, Lucian Bușoniu, Liviu Miclea epartment of Automation, Technical University of Cluj-Napoca, Abstract - Optimistic planning for eterministic systems (OP) fins near-optimal control solutions for general, non-linear systems. OP iteratively explores a search tree of action sequences by always expaning further the most promising sequence, where each expansion appens all possible one-step actions. However, the generality of the algorithm comes at a high computational cost. We aim to alleviate this complexity in a subclass of control problems where longer ranges of constant actions are preferre, by aapting OP to this class of problems. The novel algorithm is calle optimistic planning with K ientical actions (), an it creates sequences by appening to them up to K repetitions of each possible action. In our analysis we show that inee, offers a ilar a posteriori performance as an in certain cases the tree epth reache (a measure of the performance) is increase compare to OP. Our experiments, performe on the inverte penulum an HIV infection treatment control, confirm that for suitable control problems can perform better than OP, for properly tune parameter K. Keywors online planning; piecewise constant control; preictive control; optimization 1. INTROUCTION We consier optimal control problems moele as Marov ecision processes, in which a nonlinear system must be controlle in iscrete time so as to imize a iscounte sum of rewars along an infinite horizon (the return). Examples of such performance inices inclue processing times, energy or resource usage, etc. Many algorithms have been propose to solve this type of problem, coming from a variety of fiels incluing reinforcement learning [1], approximate ynamic programming [2], optimal control, stochastic programming, etc. In this paper, the rewars are assume to be boune, the control actions (inputs) must resie in a finite set with M elements. We focus on online planning algorithms, which wor by searching at each step for an action sequence that is best locally for the current state of the system, an then applying the first action of this sequence. Specifically, we employ an exten the algorithm Optimistic Planning for eterministic Systems (OP) [3], which iteratively expans a tree representation of the possible sequences of actions. At each iteration, OP further refines a sequence that has the largest upper boun on the iscounte return, hence the ajective 'optimistic'. Refinement is one by appening all possible, singlestep actions to the en of the sequence, resulting in M chil noes (sequences). Then, at the en, a safe choice is mae by returning the sequence with the largest lower boun. In certain optimal control problems, sequences that eep the same action constant for a larger number of steps are preferre. This is e.g. a feature of bangbang solutions to time-optimal control problems (e.g. in vehicle path planning, aerospace) [4]. Switche systems where the actions are moe selections are another example, since switches between actions are often energetically costly (e.g., rising or lowering a canal barrier [5]), or switching too often may estabilize the system [6]. In networe controlle systems, changing the action implies sening it over the networ an is therefore costly in terms of communication. Our aim in this paper is to aapt optimistic planning to such problems, where repeate actions are preferable. The main iea is ple: rather than appening single actions to the optimistic sequence, we appen up to K repetitions of the same action, where K is the main parameter of the algorithm. Thus, if such repeate-action solutions are better, the algorithm will have the opportunity to iscover them earlier than OP. The other elements of the algorithm remain the same as OP (namely, an optimistic sequence is refine, an a sequence imizing the lower boun is returne at the en). We call the moifie algorithm Optimistic Planning with K ientical actions,. We provie analytical insight into the behavior of the algorithm. We start by showing that the nearoptimality of the solution is still ictate by the eepest expane sequence, lie for OP (the eeper, the closer to the optimum). However, this epth will be ifferent 2015 Meiamira Science Publisher. All rights reserve

2 28 ACTA ELECTROTEHNICA from that in OP ue to the ifferent expansion rule. In particular, we analyze two interesting special cases illustrating that in certain problems the epth will be larger an using pays off; but that in other problems, OP will expan eeper trees an so it remains preferable. Experimental results are given for two optimal control problems: swinging up an uneractuate inverte penulum, an treatment of HIV infection by optimal switching of rugs on an off. In our ulations, leas to shallower trees than OP, so we cannot tae avantage of the analytical results. Nevertheless, in practice outperforms OP for the control problems investigate, an we expect that for properly tune parameter K this will hol true in a wier class of problems. Several other types of optimistic planning algorithms have been propose [7], such as for stochastic problems [8], [9] an continuous action spaces [10], [11]. Closer to the present wor is our approach from [12], where planning only explores sequences that switch between actions a limite number of times. In contrast, the approach from here oes not limit the number of switches, instea only exploring repeate-action solutions earlier. In the control fiel, relate approaches inclue piecewise-constant control [13] an specifically piecewise-constant moelpreictive control [14], [15], as well as ilar ieas in sample-ata systems [16]. References [17] [19] focus on improving performance an reucing computation, as we o. However, whereas stability is the main concern in these wors, our theoretical iscussion explicitly focuses on the relation between computation an performance of the algorithm. Unlie [19], our algorithm oes not constrain where the action shoul be change (although it wors better when switches happen after aroun K steps). Note that online planning in general is a type of moel-preictive control. This article is a revise an extene version of the conference paper [20]. Aitional contributions with respect to the conference version inclue the analysis of two illustrative cases to provie more insight, an a proof of the main result; the new HIV infection example; an a more etaile experimental investigation in the inverte penulum problem. The rest of the paper is structure as follows. Section II escribes the optimal control problem an OP. Section III presents with its analysis, an Section IV shows our ulation results. Section V gives our conclusions. 2. OPTIMAL CONTROL AN OPTIMISTIC PLANNING FOR ETERMINISTIC SYSTEMS Optimal control problems are often efine as Marov ecision processes (MP) using states x X, actions u U, a state transition function f ( x, u) = x' an a corresponing rewar function ρ : X U R, r ( x, u) = r. Besies escribing the system ynamics by function f (i.e. the transition from state x to x when applying control action u), MPs also provie a characterization of the quality of each transition, by means of ρ ( x, u). iscounte optimal control fins for any given initial state x 0 an infinite action sequence h = ( u 0, u 1,...) that imizes the iscounte sum of rewars: v( h ) = ρ( x, u ) (1) = 0 where [0,1) is the iscount factor. Function v is calle the value function, an the optimal value is enote v = supv( h ). h An optimal control metho, optimistic planning for eterministic systems (OP) [3] esigns control problems as MPs, assuming a finite an iscrete action space U = { u 1,..., u M }, the system ynamics f an the rewar function ρ to be nown, an the rewars to be boune, ρ ( x, u) [0,1], x, u. With these assumptions, OP loos for the optimizer of a problem by constructing search trees of action sequences. Given a computational buget n of allowe number of expansions, OP starts from the empty sequence an iteratively appens in each expansion step all possible actions from U to an existing sequence. In this manner, a tree is constructe where the noes correspons to the control action taen. Each expane noe will have M chilren. Each leaf state etermines an action sequence by reaing all actions from the root to a given leaf. An example OP tree is shown in Fig. 1. Fig. 1. OP search tree after three expansions, with M=2 action space size. Noes contain actions, an each expane noe has M chilren, one corresponing to each action from the action space. A sample action sequence, leaing to the bol leaf from tree epth = is h 3 = ( u, u, u ). The tree expansion is performe base on the following rule: for each leaf (i.e. ulate, finitely long action sequence h ), an upper boun b( h ) is

3 Volume 56, Number 1-2, calculate on the value function of all the infinite action sequences h that pass through that leaf: b( h ) = l( h ) v( h ) (2) 1 l( h l efines a lower boun on h ) where h ) 1 ( v : ( ) = ρ( x, u ) v( h ) (3) = 0 switche systems. In networe control systems, banwith limitations favour fewer transmissions of control action upates, an thus less action switches in the action sequences. Switche systems commonly have high costs for changing the current moe of the system (e.g. opening a barrier, or altering the state of a mechanical switch) an thus prefer longer ranges of constant actions. For such control problems, moifies the principle of OP as follows: uring the optimistic search tree construction, as with each expansion step M K chilren to a noe, taing each possible action with recurrence 1 to K. Taing K=1 reuces to OP. Because ρ( x, u ) taes values between 0 an 1, these are vali bouns on v( h ) from (1). Having these bouns calculate, the leaf with the highest upper boun (also calle b-value) is selecte for further expansion, after which the proceure is repeate. The algorithm is calle optimistic as it always maes the selection for expansion by assuming the best possible value for the rewars that have not yet been observe. At the en of the search, after the n expansions have been exhauste, the leaf with the highest lower boun is chosen as the near-optimal solution: h ' = arg l( h ) (4) h Woring in a receing horizon fashion, usually, OP applies to the system only the first action from h ', after which the algorithm is repeate for the new state the system reache. Proposition 1. OP expans only noes that satisfy v l( h ) /(1 ). Further, OP is /(1 ) -optimal, with the epth of the eepest expane noe. In other wors, the cumulative rewar of the chosen action sequence h ) is at most /(1 ) ( smaller than the optimal solution. Note that this information is available only a posteriori, i.e. after running the algorithm. A formulation of this a posteriori guarantee with a etaile proof, as well as an a priori boun, are provie in [3]. 3. OPTIMISTIC PLANNING WITH K IENTICAL ACTIONS Optimistic planning with K ientical actions () consiers the subclass of control problems where ranges of ientical control actions are preferre, lie in the case of networe control systems or Fig. 2. search tree with M=3 possible actions an K=2 imum repetitions. The left graph is a compact representation of m the tree, with u, m inicating the sequence of action u repeate K times. The same tree is unwrappe in the OP representation in the right graph. uring our analysis, we will use both representations. The novelty of the algorithm is the aitional evaluation of sequences of repeate actions. In case of control problems where ranges of ientical control actions are preferre, intuitively, shoul reach a near-optimal solution with fewer expansions than OP. We enote sequences from the compact tree as h', while corresponing unwrappe sequences are enote h. Note also the change of notation for the epths. An example sequence from the compact tree 3,1 1,2 from epth =2 is h ' 2 = [ u, u ] that correspons to the unwrappe action sequence h 3 = [ u, u, u ] with =3. Note that the unwrappe tree epths moify the b-value efinition as follows: b( h ) = l( h ) v( h ) 1 with the lower boun efine as: l( h 1 (5) ) = ρ( x, u ) v( h ) (6) = 0 These relations hol since v ( h ) is calculate for the unwrappe sequences in the same manner as in case of OP.

4 30 ACTA ELECTROTEHNICA The tree contains uplicate unwrappe 3,2 sequences. For instance, sequences h ' = [ ] an 1 u 3,1 3,1 h ' = [ u, ] from Fig. 2 contain the same 2 u 3 3 unwrappe action sequences, h 2 = [ u, u ]. This reunancy is unwante an shoul thus be avoie. The way this is performe oes not influence the analysis an remains a etail of implementation. Proposition 2. expans only noes that satisfy v l( h ) /(1 ) with the cumulative epth of the search tree. Therefore, is /(1 ) -optimal, with the largest cumulative epth of the eepest expane noe h. ' Proof. We follow ilar line to the proof for the OP a posteriori guarantee from Proposition 1, etaile in [3]. At any iteration, there exists a noe h in the tree that is the initial subsequence of an optimal sequence. Hence, b( h ) v by efinition (5). While we expan a possibly ifferent noe h, because we imize b-values we have b( h ) b( h ) v. Taing (5), equivalently v l( h ) /(1 ). enote now the noe returne by the algorithm by h ret. Since the lower bouns only increase with further expansions an h is alreay containe in the search tree, we have l ( h ) l( h ). So finally, ret /(1 ) from which by efinition of l( hret ) v the lower boun from (6), v( hret ) /(1 ) v or v v( h ret ) /(1 ). We tae the most favorable, which means the epth of the eepest expane noe. Besies the a posteriori guarantee, another performance inicator is the a priori boun that relates the available computational buget n to the expansion epth reache, an so to the near-optimality of the algorithm. In case of, obtaining such a boun is more ifficult, as the increase of the tree epth is relate both to the available computational buget an the factor K, where the latter influences the epth increase arbitrarily (between 1 an K with each expansion). etaile iscussion can be foun in [20]. The performance of can thus be evaluate only a posteriori. The obtaine boun cannot be irectly compare to the one of OP, as given the same control problem the two algorithms expan ifferent search trees. As expans more noes with each step, <, from where one cannot conclue a OP irect relation between an, the real OP action sequence lengths. In certain cases outperforms OP, an these are the types of problems we are intereste in. In other cases OP will wor better. Consier two special cases, presente in Fig. 3. In the first case (left graph), there exists a single path with rewars always equal to one, an all the other branches have zero rewar. In the secon case (right graph), uniform zero rewars are obtaine except for one path, where each K-th action will have a rewar of one. In both cases, we evaluate the tree epth the algorithms reach (as an inicator of the near-optimality), consiering the computational buget n as the number of noes ae to the search tree. Note that this is a more fair measure of computation, as the most resource consuming operation is the ulation of transitions (i.e. applying an action, which correspons to the aition of a noe). We enote the eepest expane noe by OP, respectively for the two algorithms. In the first case of a path of rewars equal to one, both algorithms expan only on this path. OP expans thus at epth = ( n 1) / M. For we will have with each expansion M K noes ae with a epth increase of K. Now, since each expane noe on the optimal path will have the same b- value, b ( h ) = 1/(1 ), the algorithm will have to apply a tie breaing rule. Commonly, when multiple caniates appear for expansion, the first create noe is selecte for further expansion, which in our case will be the noe with the fewest repetitions. Thus, eeps on expaning on the optimal path noe by noe, in increasing orer of epth, an will expan until epth = ( n 1) /( M K). Therefore, in this case < OP. In case of the secon graph from Fig. 3, OP will perform uniform expansion until epth K after which continue with uniform expansion only from the noe that has the rewar of one, an so on. In this manner, OP will expan subtrees having 2 K 1 K 1 M M... M 1 = ( M 1) /( M 1) 1 noes. For reaching a epth that is multiple of K, OP will therefore require a buget of K n = / K (( M 1) /( M 1) 1). On the other han, with repetition K will always fin the rewars of one with one expansion an thus explore only on the optimal path. Thus n / M n / M 1 But K OP n /(( M M 2) /( M 1)) K from where K 1 K 2 OP n /(( M M 2) / M ) K ( n / M ) K Therefore, for large enough K we have >. OP

5 Volume 56, Number 1-2, Fig. 3. Search tree with: all the rewars zero except for the most-right path (left); all the rewars zero except for every K-th transition on the most-right path, with K=2 (right). Besies these special cases, the experiments from the sequel show how OP compares with in further situations. 4. EXPERIMENTAL RESULTS The algorithms are evaluate consiering two control problems: the inverte penulum an HIV infection treatment. etaile test results an iscussions are provie with the inverte penulum, while the treatment control problem is consiere as it is a more complex, high-imensionality example Inverte penulum swing-up In the inverte penulum problem, optimal control aims to swing up an stabilize an uneractuate penulum. The control power is lower than it woul be sufficient to swing up the penulum using a single rotation. Thus, several swings are require to stabilize the penulum in the pointing up state. As the swings can be obtaine by sequences of constant actions, an the multiple swings require long planning horizon, this control problem matches the problem class where is expecte to provie goo performance. Fig. 4. Inverte penulum x = ] with [, ) ra T [ x1, x2 x1 = α π π x2 = α [ 15π,15π ra/s. an ] The angles cover the entire circle with 0 for the pointing up state, whereas the angular velocity is boune by saturation to the given interval. The actions are taen in the iscrete space U u,0, u } { = V, representing the voltage applie to the motors. Except for the last set of experiments, we tae u 2 V. The = rewar function is consiere in its unnormalize form T T as ρ ( x, u) = x Qx u Ru, with parameters 1 Q = an R=0.1. The normalization of the rewar function to the interval [0, 1] is performe base on the state bouns. The iscount factor use for calculating the cumulative rewar is set to = We specify computational bugets using n, the number of noes ae to the search tree (corresponing to the number of ulate transitions). In case of OP n = M n 1, whereas for n = K M n 1. Another remar is that the uplicate elimination in case of leas to smaller search trees than those of OP. For this reason, an equal treesize variant of the algorithm will be consiere as well, which in place of the eliminate uplicates as further noes to the tree until reaching the same treesize n as with OP. We consier two types of experiments. In the first, we calculate offline regrets that show the nearoptimality of the algorithm. We tae a set of initial states { π, 5 / 6 π,..., π} ra { 15π, 14π,...,15π } ra/s, an for each state calculate a single control action. Finally, we evaluate the average of the real regret 1. The secon set of experiments calculates the online return, that is the cumulative rewar obtaine while applying a sequence of actions in close-loop. In these experiments, the penulum is starte from the initial T state x 0 = [ π,0] (penulum pointing own, with zero angular velocity) an the algorithms calculate a nearoptimal control action an apply to the system, reaching a new state, after which the searches are repeate. We allow the experiments to run for T=4s, which is usually enough for swinging up an stabilizing the penulum. The penulum is moele using a state vector that consists of the angle of the penulum an the angular velocity: 1 The real regret calculates the near-optimality of the algorithm with respect to the optimum v, value available from another algorithm at much higher computational costs. The formula is r = v l( h ' ). n

6 32 ACTA ELECTROTEHNICA The sampling time is taen TS = s, which results in 160 search tree constructions. We first analyze the effect of varying the value of K for a given computational buget. After several preliminary tests, we observe that OP reaches a nearoptimal solution (i.e. is able to swing up an stabilize the penulum) using a buget of n = 1500 noes ae to the tree per ulation. Taing the algorithm parameters in the set K {1,2,4,5,8,10,16,20,32}, we select n so to have all noes completely expane, i.e. in case to be the multiple of M K for any K. Therefore, we tae n = Fig. 5 shows the results from the offline tests, while Fig. 6 compares online performance. for both of its variants. Comparing the two variants of, there is no evience that the equal treesize algorithm that expans more noes than the other one woul result in the same or better guarantees. Nevertheless, there is no ecreasing tren of the regrets with respect to the computational buget for any of the algorithms in general. Fig. 7. Regret obtaine with K=16 Further ulations analyze the tree epth increase in case of offline tests. The tree epth is etermine by the epth of the eepest expane noe, which in case of OP matches the epth introuce in Section II, while for it is the cumulative epth also in Section III., use Fig. 5. Average regret for variable K an n = 1920 From Fig. 5, the best (smallest) regret is obtaine for K=16 an one may observe that smaller values of K o not yet reach the performance of OP, while higher values of K alreay provie weaer performance. This confirms that the choice of K is subject to fine tuning an that higher values than the chosen one weaen the performance. All these conclusions are confirme by the obtaine cumulative rewars in the online experiments as well, presente in Fig. 6. Fig. 6. Return for variable K an n = 1920 Now, taing in the sequel K=16 for, we perform offline an online tests for a set of computational bugets. Fig. 7 presents the offline test results. The first conclusion is that offers better performance than OP, having lower regret than OP Fig. 8. Tree epth with K=16 Fig. 8 confirms that higher bugets allow for the construction of eeper search trees an the increase of the tree epth is monotonic. Comparing the two variants, as the equal treesize variant expans more noes than the other one, a small increase in epth is also observe. Nevertheless, a greater ifference is observe between the epths reache by OP an the variants. OP constructs much eeper search trees, i.e. OP >, unlie the expectations from the analysis. This confirms our iscussions from Section III regaring the epths reache by the algorithms an invaliates thus the initial intuition of constructing eeper trees with, at least in case of the current control problem. Nevertheless, the actual performance obtaine for the inverte penulum are better for, as shown in Fig. 7. Thus, for suitable control problems, the real performance of might

7 Volume 56, Number 1-2, be better than the performance of OP, as shown in the sequel with the online experiments as well. Fig. 9. Returns obtaine for K=16. Fig. 9 shows the online experimental results, i.e. the returns obtaine by the algorithms when performing the penulum swing-up, using ifferent values for the computational buget. The near-optimal solution is aroun a return of 30, which is alreay obtaine by all the algorithms for n = Two important conclusions can be rawn. First, the algorithm obtains overall higher return than OP, i.e. has a better performance than OP. Secon, reaches the nearoptimal solutions sooner than OP. This can be seen for small values of n where OP is sub-optimal but alreay provies near-optimal solutions. Finally, eeping K=16 for, we compare the online performance of the algorithms when varying the control input. strategies for their use. In the structure treatment interruptions strategy, the patient is cycle on an off rugs, see e.g. [21]. The HIV infection ynamics are escribe by a six-imensional nonlinear moel with two binary inputs corresponing to the application of the two rugs, so there are 4 iscrete actions; the sampling time is 5 ays. The objective is to rive the system from an unhealthy equilibrium, where the infection has taen hol, to the basin of attraction of a healthy equilibrium where the patient controls the infection without the nee for rugs. For the moel, parameters, an rewar function, see [21]. In our experiments, we consier using the two rugs in combinations {{ 0,0},{0,0.03},{0.7,0},{0.7,0.03}}. These combinations refer to the quantity from the two rugs consiere uring each time sample of 5 ays. After several tests, we obtain that OP is able to fin near-optimal solutions with trajectory lengths of 200 ays, i.e. it can fin combinations that ai healing when ulating treatments that woul last 200 ays. Taing this trajectory length, we consier an vary the number of repetitions K to verify our previous results. Fig. 11 shows the return obtaine when applying the algorithms online, in ulations. Fig. 11. HIV infection treatment: evaluating for variable K Fig. 10. Returns obtaine for K=16 an state spaces U = { 1.5,0,1.5} V (left) an U = { 3,0,3} V (right). Figures 9 an 10 present the results for three ifferent values of u from the action space U: Fig. 9 consiers u = 2 V, while Fig. 10 shows results for u = 1.5 V an u = 3 V. Looing at these results, taing ifferent control inputs, eeps its performance compare to OP. However, its improvement compare to OP varies base on which MP is consiere, fact that confirms that the value of K is subject to fine-tuning accoring to the given control problem HIV infection treatment Prevalent HIV treatment strategies involve two types of rugs. These rugs have negative sie effects in the long term, which motivates research into optimal Looing at Fig. 11, one may observe that increasing the value of K oes not necessarily result in better performance. Thus, this problem confirms too that the value of K has to be tune. However, one can see again that for certain values, is able to provie better near-optimal solutions than OP. 5. CONCLUSIONS In this paper the novel algorithm calle was introuce, that aresses the complexity concerns of the OP algorithm for a subclass of control problems by evaluating longer ranges of constant action sequences in aition to the one-step actions. In the class of problems where the control action is preferre to be rarely change (e.g. switche systems lie barriers or mechanical switches) was expecte to provie closer solutions to the optimum than OP is able to. Our analysis showe that inee, has ilar performance compare to OP an it even outperforms

8 34 ACTA ELECTROTEHNICA the latter in certain problem types, though a general relation between the performance of the two algorithms coul not be rawn. The experiments confirme that for the aresse subclass of control problems, when choosing parameter K properly, outperforms OP in terms of near-optimality. In this paper a novel algorithm calle optimistic planning with K ientical actions () was introuce, that extens optimistic planning for eterministic systems (OP) by evaluating longer ranges of constant action sequences in aition to onestep actions, whenever an action sequence is further refine. In the class of problems where the control action shoul preferably be change rarely, such as switche systems or networe control systems, is expecte to provie closer solutions to the optimum than OP is able to. Our analysis showe that, has ilar performance guarantees compare to OP an it can outperform the latter in certain types of problems, although a general relation between the performance of the two algorithms coul not be rawn. Experiments confirme that, for an inverte penulum an an HIV infection problem, outperforms OP when choosing parameter K properly. An interesting irection for future wor is a more complete analytical characterization of the class of problems where is expecte to wor well, an an a priori guarantee relating buget to near-optimality, lie the one available for OP. ACKNOWLEGEMENTS This paper is supporte by the Sectoral Operational Programme Human Resources evelopment (SOP HR), I (POSRU/159/1.5/S/137516) finance by the European Social Fun an by the Romanian Government; an by a grant from the Romanian National Authority for Scientific Research, CNCS- UEFISCI, project number PNII-RU-TE REFERENCES 1. R. S. Sutton an A. G. Barto, Reinforcement Learning: An Introuction. MIT Press, P. Bertseas, ynamic Programming an Optimal Control, 4 th e. Athena Scientific, 2012, vol J.-F. Hren an R. Munos, Optimistic planning of eterministic systems, in Proceeings 8th European Worshop on Reinforcement Learning (EWRL-08), Villeneuve Ascq, France, 30 June 3 July 2008, pp R. Vinter, Optimal control. Springer Science Business Meia, H. van Eeren, R. Negenborn, P. van Overloop, an B. e Schutter, Time-instant optimization for hybri moel preictive control of the Rhine-Meuse elta. Journal of Hyroinformatics, vol. 15, no. 2, pp , J. C. Geromel an P. Colaneri, Stability an stabilization of iscrete time switche systems. International Journal of Control, vol. 79, no. 7, pp , R. Munos, The optimistic principle applie to games, optimization an planning: Towars founations of Monte- Carlo tree search, Founations an Trens in Machine Learning, vol. 7, no. 1, pp , S. Bubec an R. Munos, Open loop optimistic planning, in Proceeings 23r Annual Conference on Learning Theory, Haifa, Israel, June 2010, pp L. Busoniu, R. Munos et al., Optimistic planning for Marov ecision processes, in Proceeings of the 15th International Conference on Artificial Intelligence an Statistics, AISTATS- 12, vol. 22, 2012, pp L. Busoniu, A. aniels, R. Munos, an R. Babusa, Optimistic planning for continuous action eterministic systems, in 2013 IEEE International Symposium on Aaptive ynamic Programming an Reinforcement Learning (APRL-13), Singapore, April A. Weinstein an M. L. Littman, Banit-base planning an learning in continuous-action Marov ecision processes, in Proceeings of the 22 n International Conference on Automate Planning an Scheuling (ICAPS), K. Mathe, L. Busoniu, R. Munos, an B. e Schutter, Optimistic planning with a limite number of action switches for near-optimal nonlinear control, in ecision an Control (CC), 2014 IEEE 53 r Annual Conference on, ec 2014, pp M. Quincampoix an N. Seube, Stabilization of uncertain control systems through piecewise constant feebac, Journal of mathematical analysis an applications, vol. 218, no. 1, pp , L. Magni an R. Scattolini, Moel preictive control of continuous-time nonlinear systems with piecewise constant control, Automatic Control, IEEE Transactions on, vol. 49, no. 6, pp , , Tracing of non-square nonlinear continuous time systems with piecewise constant moel preictive control, Journal of Process Control, vol. 17, no. 8, pp , Y.-Y. Cao, L. Hu, an P. Fran, Moel preictive control via piecewise constant output feebac for multirate sample-ata systems, in ecision an Control, Proceeings of the 39th IEEE Conference on, vol. 1. IEEE, 2000, pp R. Fineisen an F. Allgower, Computational elay in nonlinear moel preictive control, in Proceeings International Symposium on the Avance Control of Chemical Processes, 2004, pp X. Yang an L. T. Biegler, Avance-multi-step nonlinear moel preictive control, Journal of Process Control, vol. 23, no. 8, pp , C. Liu, W.-H. Chen, an J. Anrews, Piecewise constant moel preictive control for autonomous helicopters, Robotics an Autonomous Systems, vol. 59, no. 7, pp , K. Mathe, L. Busoniu, an L. Miclea, Optimistic planning with long sequences of ientical actions for near-optimal nonlinear control, in International Conference on Automation, Quality an Testing, Robotics. IEEE, 2014, pp B. Aams, H. Bans, H.-. Kwon, an H. Tran, ynamic multirug therapies for HIV: Optimal an STI control approaches, Mathematical Biosciences an Engineering, vol. 1, no. 2, pp , Koppany Mathe epartment of Automation, Technical University of Cluj- Napoca, Memoranumului 28, Cluj-Napoca, Romania Koppany.Mathe@aut.utcluj.ro

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set