Fuzzy Model-Based Reinforcement Learning

Size: px

Start display at page:

Download "Fuzzy Model-Based Reinforcement Learning"

Elijah Griffith
5 years ago
Views:

1 ESIT 2, September 2, Aachen, Germany 212 Fuzzy Model-Based Reinforcement Learning Martin Appl 1, Wilfried Brauer 2 1 Siemens AG, Corporate Technology Information and Communications D-8173 Munich, Germany Phone: , Fax: mail@martinappl.de 2 Technical University of Munich Department of Computer Science D-829 Munich, Germany brauer@informatik.tu-muenchen.de ABSTRACT: Model-based reinforcement learning methods are known to be highly efficient with respect to the number of trials required for learning optimal policies. In this article, a novel fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping F-PS, is presented. The approachis capableof learning strategies for Markov decision problems with continuous state and action spaces. The output of the algorithm is a Takagi- Sugeno fuzzy system with linear terms in the consequents of the rules. From the Q-function approximated by this fuzzy system an optimal control strategy can be easily derived. The proposed method is applied to the problem of selecting optimal framework signal plans in urban traffic networks. It is shown that the method outperforms existing model-based approaches. KEYWORDS: reinforcement learning, model-based learning, fuzzy prioritized sweeping, Takagi-Sugeno fuzzy systems, framework signal plans INTRODUCTION Reinforcement learning means learning from experiences Sutton and Barto1998, Bertsekas and Tsitsiklis1996. A reinforcement learning agent perceives certain characteristics of its environment, influences the environment by performing actions and finally gets rewards due to the appropriateness of the selected actions. One can distinguish between indirect and direct reinforcement learning methods. Indirect methods, like e.g. prioritized sweeping Moore and Atkeson1993, build an internal model of the environment and calculate the optimal policy based on this model, whereas direct methods, like Q-learning Watkins1989, do not use an explicit model but learn directly from experiences. Indirect reinforcement learning methods are known to learn in many settings much faster than direct methods, since they can reuse information stored in their internal model. Learning models of discrete environments is much easier than learning models of continuous ones. This may be the reason why most publications on model-based reinforcement learning deal with discrete Markov decision problems. Discrete methods, of course, can be also applied to continuous problems by discretizing the state and action spaces of these problems. The main challenge of this approach, however, is to define a partition of reasonable granularity, since fine partitions lead to a high number of states and thus complex problems, whereas approximations based on coarse crisp partitions can be highly imprecise. Model-based learning in continuous state spaces was previously discussed by Davies1997, who suggested to define a coarse grid on the state space and to approximate the continuous value function by performing interpolation based on this grid. This approximation approach is comparable to a Takagi-Sugeno fuzzy system with triangular membership functions and constant terms in the consequents of the rules. Davies, however, used a crisp partition for the training of the transition probabilities and the corresponding rewards, which seems to be inconsistent with the idea of interpolating. Besides, he did not consider continuous actions. In this article, a fuzzy model-based reinforcement learning approach, fuzzy prioritized sweeping F-PS, is considered. The approach is capable of learning strategies for problems with continuous state and action spaces. The output of the F-PS approach is a Takagi-Sugeno fuzzy systems with linear rules Takagi and

2 ESIT 2, September 2, Aachen, Germany 213 Sugeno1985. With such fuzzy systems continuous value functions can be approximated much more precisely than with approximation architectures based on crisp partitions. Alternatively, the number of partitioning subsets can be reduced. The proposed method is applied to the problem of selecting framework signal plans in dependence of traffic conditions. Several approaches applying reinforcement learning to problems from traffic signal control can be found in the literature e.g. Thorpe1997, Bingham1998, Appl and Palm1999. To the authors, however, no publication on the selection of framework signal plans by means of reinforcement learning methods is known. In the following section the basic Markov decision problem on which the further considerations are based is introduced. Afterwards, the fuzzy model-based reinforcement learning approach is presented. Finally, the effectiveness of the proposed algorithm is shown by the task of selecting framework signal plans. BASIC MODEL In the following it is assumed that the reinforcement learning agent gets inputs from a continuous state space X of dimension N X and may perform actions taken from a continuous action space A of dimension N A.Thesetsof dimensions of the state space and the action space will be denoted by D X := {1,..., N X } and D A := {1,..., N A } respectively. Let, for each state x X and each action a A, py; x, a beaprobability density function giving the distribution of the successor state y if action a is executed in state x. Furthermore, let gx, a, y Rbe the unknown reward the agent gets for executing action a in state x if the action causes a transition to state y. The agent is supposed to select actions at discrete points in time. The goal of the learning task then is to find a stationary policy µ : X A, i.e. a mapping from states to actions, such that the expected sum of discounted future rewards { N } J µ x := lim E α κ g x κ,µx κ, x κ+1 N x = x, α [, 1 1 κ= is maximized for each x X,wherex κ+1 is determined from x κ using px κ+1 ; x κ,µx κ. Let Q µ x, a := py; x, a gx, a, y+α J µ y dy, 2 be the sum of discounted future rewards the agent may expect if it executes action a in state x and behaves according to the policy µ afterwards. Then, the optimal Q-values Q µ x, a are given by the fixed-point solution of the Bellman equation Q µ x, a = py; x, a gx, a, y+α max b A Q µ y, b dy, 3 and the optimal policy µ is to execute in each state x the action a that maximizes these Q-values: µ x :=argmax Q µ x, a. 4 a A µ The F-PS approach described in the following approximates the continuous Q-function Q by a Takagi-Sugeno fuzzy system. Thereto, it is assumed that a fuzzy partition {µ X i } i I of the state space is defined, where the subscripts of the N µx membership functions are given by I = {1,..., N µx } and the labels and centers of the partitioning subsets are given by {X i } i I and { x i } i I, respectively. Likewise, it is assumed that the action space is partitioned by {µ A u } u U, whereu = {1,..., N µa } is the set of subscripts, {A u } u U gives the labels and {ã u } u U the centers of the N µa subsets of the partition. FUZZY MODEL-BASED LEARNING The basic idea of the F-PS approach presented in the following is to learn an approximation of the unknown continuous Q-function Q µ, from which the optimal strategy can be easily derived cf. eqn. 4. The Q-function will be approximated by the Takagi-Sugeno fuzzy system Takagi and Sugeno1985, Sugeno1985 if x is X i and a is A u then Q µ x, a = ˆQ iu + l D X ˆQxl iu x l x i,l + l D A ˆQal iu a l ã u,l, i I,u U, 5

3 ESIT 2, September 2, Aachen, Germany 214 where ˆQ iu is an estimate of the average Q-value in X i,a u, and ˆQ x l iu and ˆQ a l iu are estimates of the local partial µ δ Q derivatives x i, ã u and δ Q µ δa l x i, ã u, respectively. The estimation of the average Q-values and the average partial derivatives will be considered in the following subsections. ESTIMATION OF AVERAGE Q-VALUES Let N iu,k be counters giving the number of executions of fuzzy action A u in fuzzy state X i until iteration k i I,u U. Likewise, let M iuj,k be counters giving the number of times that the execution of action A u in state X i caused a transition to X j i, j I,u U. On the observation of a transition x k, a k, x k+1, x k X, x k+1 X, a k A,withreward g k Rthese counters are increased according to the degrees of membership of the transition in the corresponding centers: N N iu,k + µ X i x k µ A u a k, i I,u U, 6 M M iuj,k + µ X i x kµ A u a kµ X j x k+1, i I,u U,j I. 7 Based on these counters one can estimate the probability µ X i xµa u aµ X j y py; x, adydadx p ij u := µ X i xµa u adadx that the execution of action A u in state X i causes a transition to state X j : ˆp ij,k+1 u := M. 9 N Let g iuj be the average reward the agent may expect if it executes action A u in state X i and the action causes a transition to state X j : µ X i xµa u aµx j y py; x, a gx, a, ydydadx g iuj := µ X i xµa u aµx j y py; x, adydadx. 1 Then, an estimate ĝ iuj of these average rewards can be gained by performing the update ĝ ĝ iuj,k + µx i x kµ A u a k µ X j x k+1 M [ g k ĝ iuj,k ], i I,u U,j I 11 on the observation of transitions x k, a k, x k+1, x k X, x k+1 X, a k Awith rewards g k R. Based on the discrete model ˆp ij,k+1 u, ĝ, one can nowcalculate average Q-values. It can be shown that the solution of the fixed point equation ˆQ = ˆp ij,k+1 u ĝ + α max ˆQ jv,k+1 12 v U j I gives estimates of the average Q-values Q iu := µ X i xµa u a Q µ x, adadx µ X. 13 i xµa u adadx These estimates can be used in the representation 5 of the Q-function. The system 12 can be advantageously solved by discrete prioritized sweeping Moore and Atkeson1993. ESTIMATION OF AVERAGE PARTIAL DERIVATIVES The partial derivatives Q x l iu and Qa l iu of the Q-function can be derived from average values and partial derivatives of the reward function and the transition probabilities. It can be shown that the following is satisfied for the 8

4 ESIT 2, September 2, Aachen, Germany 215 partial derivatives with respect to the dimensions of the state space: iu = δ Q µ x i, ã u Q x l 3 = δ py; x, ã u gx, ã u, y+αmax Q µ y, b b A dy x= x i 14 p x l ij g u iuj + α max Q jv + p ij ug x l iuj, 15 v U j I where the average rewards g iuj and transition probabilities p ij u were defined in the preceding section and the average derivatives p x l ij u andgx l iuj are given by µ X i xµa u a µ X j y δ py; x, ady da dx p x l ij u := µ X, 16 i xµa u ada dx g x l iuj := µ X i xµa u aµ X j y δ gx, a, y py; x, ady da dx µ X i xµa u aµ X j x py; x, ady da dx. 17 Likewise, the partial derivatives with respect to the dimensions of the action space can be approximated as follows Q a l iu = δ Q x i, ã u p a l ij g δa u iuj + α max Q jv + p ij ug a l iuj, 18 l v U j I where the abbreviations p a l ij u := g a l iuj := µ X i xµa u a µ X j y δ δa l py; x, ady da dx µ X i xµa u ada dx, 19 µ X i xµa u aµ X j y δ δa l gx, a, y py; x, ady da dx µ X i xµa u aµ X j y py; x, ady da dx 2 were introduced. In the following subsections, it will be shown how the average partial derivatives of the reward function and the conditional probability density function can be estimated from observed transitions. Then, the partial derivatives of the Q-function can be estimated using the approximations 15 and 18. Partial Derivatives of the Reward Function The average local reward g iuj and the average local derivatives g x l iuj and ga l iuj of the reward function g can be estimated by adapting the parameters ĝ iuj,ĝ x l iuj and ĝa l iuj of the following linear function to experiences in the vicinity of the center x i, ã u, x j : ǧx, a, y :=ĝ iuj + l D X ĝ xl iuj x l x i,l + l D A ĝ al iuj a l ã u,l + l D X ĝ yl iuj y l x j,l. 21 On the observation of a transition x k, a k, x k+1 withreward g k, the parameters can be adapted by performing a gradient descent with respect to the following error measure: Let E := 1 2 g k ǧx k, a k, x k η iuj,k := µx i x kµ A u a k µ X j x k+1 M, 23

5 ESIT 2, September 2, Aachen, Germany 216 be the stepsizes for the gradient descent, such that the stepsize for a given center is weighted by the membership of observed transitions in this center and decreases gradually. Based on 22 and 23, the following update rules can be derived i, j I,u U: ĝ = ĝ iuj,k + η iuj,k g k ǧx k, a k, x k+1, 24 ĝ x l = ĝ x l iuj,k + η iuj,kx k,l x i,l g k ǧx k, a k, x k+1, l D X 25 ĝ a l = ĝ a l iuj,k + η iuj,ka k,l ã u,l g k ǧx k, a k, x k+1, l D A 26 ĝ y l = ĝ y l iuj,k + η iuj,kx k+1,l x j,l g k ǧx k, a k, x k+1 l D X. 27 Note that an alternative update rule for ĝ was defined in 11. Partial Derivatives of the Conditional Probability Density Function The average partial derivatives of the conditional probability density function can be approximated as follows: p x l ij u p a l ij u µ X i xµa u a µ X X py;x+ɛen l,a py;x ɛe j y N X l,a 2ɛ µ X i xµa u a µ X i xµa u ada dx µ X A py;x,a+ɛen l j y dy da dx py;x,a ɛe N A l 2ɛ dy da dx, 28 µ X i xµa u ada dx, 29 where e d l is a vector of dimension d with components e d l,i = δ il,i =1,..., d, δ is the Kronecker symbol and ɛ is a small constant. Let L x l,+ iu count the number of executions of action A u in a fuzzy state that results from shifting state X i along dimension l by ɛ, andletm x l,+ iuj count the number of times that action A u caused a transition from this state to X j. Likewise, let L x l, iu be a counter for the number of executions of action A u in a state that results from shifting state X i along dimension l by ɛ, andletm x l, iuj count the number of times that A u caused a transition from this state to X j. On the observation of a transition x k, a k, x k+1,g k, these counters can be updated as follows i I,u U: In a similar way counters L a l,+ iu I,u U: L x l,+ L x l,+ iu,k + µx i x k ɛe N X l µ A u a k, 3 M x l,+ M x l,+ iuj,k + µx i x k ɛe N X l µ A u a kµ X j x k+1, j I, 31 L x l, L x l, iu,k + µx i x k + ɛe N X l µ A u a k, 32 M x l, M x l, iuj,k + µx i x k + ɛe N X l µ A u a k µ X j x k+1, j I. 33,M a l,+ iuj, La l, iu and M a l, iuj with the following update rules can be defined i L a l,+ L a l,+ iu,k + µx i x kµ A u a k ɛe N A l, 34 M a l,+ M a l,+ iuj,k + µx i x k µ A u a k ɛe N A l µ X j x k+1, j I, 35 L a l, L a l, iu,k + µx i x kµ A u a k + ɛe N A l, 36 M a l, M a l, iuj,k + µx i x kµ A u a k + ɛe N A l µ X j x k+1, j I. 37 Then, the average partial derivatives 28 and 29 can be estimated as follows i, j I,u U: ˆp x l ij,k+1 u := 1 M x l,+ M x l,, 38 2ɛ ˆp a l ij,k+1 u := 1 2ɛ L x l,+ M a l,+ L a l,+ L x l, M a l, L a l,. 39

6 q t q t q t ESIT 2, September 2, Aachen, Germany 217 request cycle time extension residential area A shopping center south +cinema shopping center north Figure 1: Example framework signal plan and test scenario. OPTIMAL SELECTION OF FRAMEWORK SIGNAL PLANS B D C industrial area Framework signal plans define constraints on signal control strategies in traffic networks. A framework signal plan usually comprises individual signal plans for all traffic signals controlled by the framework signal plan. In the left part of figure 1 an example signal plan is depicted. Green phases of the traffic signal controlled according to this signal plan have to start within the request -interval and have to end within the extension - interval. Within the leeway given by signal plans, traffic-dependent optimization may be performed or public transportation may be prioritized. Sophisticated traffic control systems are able to choose between different framework signal plans in dependence of traffic conditions. The rules controlling this selection are usually tuned by hand, which is not trivial in complex traffic networks. The task of selecting framework signal plans in dependence of traffic conditions, however, can be considered as a Markov decision problem, where the state is composed of measurements made on the traffic network and the framework signal plans are the available actions. In the following the scenario shown in the right part of figure 1 will be considered. The traffic density is measured at the three points indicated by arrows. It is assumed that three framework signal plans are given. Plan 1 favors horizontal traffic streams and should therefore be used in the morning when people go to work. In Plan 2, horizontal and vertical phases have the same length, such that this plan is suitable at noon and in the afternoon when people go shopping and return from work. The third plan finally favors traffic flows between the residential area and the cinema and should therefore be selected in the evening. During learning the controller gets the following rewards: g := l ρl ρ l,max 2, 4 where ρ l and ρ l,max give the average and maximum density, respectively, of vehicles in link l. The basic idea behind this definition is that the average density in the road network is to be minimized, where homogeneous states in which all roads have a similar density result in larger rewards than inhomogeneous states. X µ i x 1 is_vs is_s is_m is_h is_vh.5 1 ρ/ρ max X µ i x 1 is_vs is_s is_m is_h is_vh.5 1 Figure 2: Partitions of sensor signals for PS left and F-PS approach right. ρ/ρ max

7 ESIT 2, September 2, Aachen, Germany 218 total average density per day : F PS : PS number of simulated days Figure 3: Progress of framework signal plan selection with prioritized sweeping PS and fuzzy prioritized sweeping F-PS. Two algorithms were applied to this Markov decision problem: Training with prioritized sweeping Moore and Atkeson1993, where the state space was discretized by the crisp partition shown in the left part of figure 2 PS, and training with the fuzzy prioritized sweeping approach proposed in this article F-PS, where the fuzzy partition shown in the right part of figure 2 was used. The progress of these algorithms is shown in figure 3. For the plot, training was interrupted every two simulated days and the strategy learned until then was applied to the network for one further simulated day. The total rewards gained in the courses of these evaluation days are shown in figure 3, where averages over 1 runs are shown in order to reduce statistical effects. The learning task, obviously, is solved much faster by the fuzzy model-based approach than by the crisp approach. Moreover, the strategy learned by F-PS is superior to the strategy learned by PS, i.e. the continuous Q-function, obviously, can not be approximated sufficiently good by an architecture based on the crisp partition shown in figure 2. CONCLUSIONS In this article a novel fuzzy model-based reinforcement learning approach was presented. The approach represents continuous Q-functions by Takagi-Sugeno models with linear consequents. As Q-functions directly represent control knowledge, control strategies learned by the F-PS approach can be expected to be superior to strategies learned by methods based on crisp partitions. The proposed method was applied to the task of selecting optimal framework signal plans in dependence of traffic conditions. As expected, the proposed method outperforms the crisp PS approach when used with partitions of similar granularity. In the example of application presented in this article actions were discrete. The proposed algorithm, however, also performs well in environments with continuous action spaces, as can be easily tested with small toyexamples. Real-world problems with continuous action spaces will be considered in future publications. REFERENCES Appl, M.; Palm, R., 1999, Fuzzy Q-learning in nonstationary environments, Proceedings of the 7th European Congress on Intelligent Techniques and Soft Computing. Bertsekas, D. P.; Tsitsiklis, J. N., 1996, Neuro-Dynamic Programming, Athena Scientific. Bingham, E., 1998, Neurofuzzy traffic signal control, Master s thesis, Helsinki University of Technology. Davies, S., 1997, Multidimensional triangulation and interpolation for reinforcement learning, Advances in Neural Information Processing Systems, Volume 9, pp , The MIT Press. Moore, A. W.; Atkeson C. G., 1993, Memory-based reinforcement learning: Converging with less data and less time, Robot Learning, pp Sugeno, M., 1985, An introductory survey of fuzzy control, Information Sciences 36, pp Sutton, R. S.; Barto, A. G., 1998, Reinforcement Learning An Introduction, The MIT Press. Takagi, T.; Sugeno, M., 1985, Fuzzy identification of systems and its application to modeling and control, IEEE Transactions on Systems, Man and Cybernetics, Volume 15, pp Thorpe, T., 1997, Vehicle Traffic Light Control Using SARSA, Ph. D. thesis, Department of Computer Science, Colorado State University. Watkins, C. J. C. H., 1989, Learning from Delayed Rewards, Ph. D. thesis, Cambridge University.

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at