Multi-robot Formation Control Using Reinforcement Learning Metho Guoyu Zuo, Jiatong Han, an Guansheng Han School of Electronic Information & Control Engineering, Beijing University of Technology, Beijing 0024, China hjt925@63.com Abstract. Formation is a goo example of the research for multi-robot cooperation. Many ifferent ways can be use to accomplish this task, but the main rawbacks of most of these methos are that robots can t self-learn. In Brooks behavioral opinion, this paper is to verify that the reinforcement learning metho can be use for robots to select ifferent behaviors in various ifferent situations. Experiments are performe to illustrate the team robots capability of self-learning an autonomy. The results show that the robots can get a selfformation in a barrier environment after learning. Keywors: Multi-robot, Formation control, Reinforcement Learning. Introuction There are ifferent levels of nees in maintaining the robot formation s completeness uner various mission requirements. To obtain the goal of the robot formation moving to a specifie point, motion control shoul take into accounts the requirements of formation, such as avoiing obstacles an other factors. In this paper we mainly use Brooks behavioral inhibition metho, in which, at every moment, the formation task is specifie to be a series of actions, which are here efine as act l, act 2,..., act n. Using these efine behaviors, multi-robot system can perform the self-formation in an obstacle environment. Each robot's control system has a reinforcement learning moule to achieve the upper behavior control. Our aim is that through learning multirobot system can autonomously construct some goo pairs from environment to behavior. This mapping relation can make robots choose appropriate behaviors at each step in the environment to ensure the formation task s achievement without colliing with the obstacles. 2 The Leaer-Follower Control Metho The Leaer-Follower control metho is often use in multi-robot formation control, in which the team leaer ecies where the formation moves while avoiing obstacles, an the other robots, i.e., the followers, follow the leaer with a certain spee an angular velocity [, 2]. Y. Tan, Y. Shi, an K.C. Tan (Es.): ICSI 200, Part I, LNCS 645, pp. 667 674, 200. Springer-Verlag Berlin Heielberg 200
668 G. Zuo, J. Han, an G. Han Figure is the l ϕ control methos. The follower always maintains following its leaer, ensuring that l l anϕ ϕ, where l is the istance between two robots, 2 2 ϕ is the angle between them, an is the istance between the centres of robot s two wheel axes an the robot s rotation. θ 2 l 2 ϕ 2 θ Fig.. Moel of l ϕ The kinematics equations of the leaer robot are: x = v cosθ i i y = v sinθ i i () θ = ω i i here v i an ω, i=(,2) represent the line spee an angular velocity of the two robots, i respectively. The kinematics equations of the follower robot are shown as follows: l 2 = v cos γ v cosϕ + ω sin γ 2 2 2 ϕ = { v sin ϕ v sin γ + ω cos γ l ω } l 2 2 2 2 2 2 θ = ω 2 2 γ = θ + ϕ θ 2 2 (2) For the follower, the control output ( ω, v ) is: 2 2 cos γ { ( ) sin sin } = ρ ω tan γ ω = al ϕ ϕ v ϕ + l ω + ρ γ 2 2 2 2 2 2 2 2 v 2 2 2 (3)
Multi-robot Formation Control Using Reinforcement Learning Metho 669 here, ρ 2 a ( l l ) v cosϕ 2 2 2 = + (4) cos γ cos γ So the l ϕ close-loop control is expresse as: l = a ( l l ) 2 2 2 ϕ = a ( ϕ ϕ ) 2 2 2 2 where a an a 2 are the coefficients of the proportional control metho. (5) 3 Behavioral Design of Multi-robot Formation Using Reinforcement Learning If there are obstacles in the environment, the robot can not pass through the barriers while maintaining the original formation. In general, they nee to change their formation in ifferent types. When passing through a narrow obstacle, the formation usually converts into a line formation. That is, the followers nee to change the following angle with the leaer in orer to pass through the obstacles. After the followers move through the obstacles an come to a spacious environment, they change the angle again to return to the original formation. Here we use a reinforcement learning metho to esign the behaviors of the followers. Q learning is a reinforcement learning algorithm. The main iea of it is not to learn each state of the evaluation function, but to learn each state-action pairs evaluation function Q(s, a). Q(s, a) signifies the state s cumulative iscounte value after performing actions. When the unknown or ynamic enviroment changes, Q learning has a great avantage. It oes not nee to know the next state after actions, but nee to rely on the current state-action pairs s Q values to etermine the optimal strategy in state s [3, 4]. Qs (, a) = r+ max Qs (, a) (6) γ t t t t+ Equation (6) is set up only uner the conitions of the optimal strategy. In the reinforcement learning process, the two ens of equation (6) is not strictly equal, an the error is expresse as: Δ Q ( s, a ) = r + max Q ( s, a) Q ( s, a ) (7) γ t t t t t t+ t t t Q ( s, a ) = Q ( s, a ) + α ΔQ ( s, a ) t t t t t t t t t t = Q ( s, a ) + α [ r + γ max Q ( s, a) Q ( s, a )] t t t t t t t+ t t t = ( α ) Q ( s, a ) + α [ r + γ max Q ( s, a)] t t t t t t t t+ = ( α ) Q ( s, a ) + α [ r + γv ( s )] t t t t t t t t+ (8)
670 G. Zuo, J. Han, an G. Han where, V ( s ) = max Q ( s, a) t t+ t t+ α : learning rate in the moment of t, t γ : iscount factor for Vs ( ), t + r : the reinforcement signal at the moment of t to imply the action a t t. The interactions between the robots an environment are as follows: ) The robot perceives the current environment state s t, 2) Obtain a timely rewar for the current state of the environment, an perform the appropriate action accoring to a certain strategic selecte, 3) Perform the selecte actions, an the environment changes, 4) Get the next state of the environment s t +, 5) Calculate the rewar r t timely, 6) t t +, move to step 7 if the learning objectives have been achieve, otherwise, turn step 2, 7) The en of the learning process. The reinforcement learning process is as follows: each robot gets the information of environment through the by its own sonar information coming from the other robots. This type of sensor information is sent to the reinforcement learning moule (Q-learning). The reinforcement learning moule ecies to select which action to behave in accorance with all of the robot's sensor signal: act, act 2,..., act n, an the robots take the appropriate actions in the environment. The robot s environmental system will give every action ifferent enhance signal values base on the role of behavior. The system will ecie the tren of such actions, i. e., whether they will be strengthene or weakene in such an environment. An eventually the system will learn the ifferent circumstances to take appropriate actions to achieve a selfformation in the obstacles space [5, 6]. Reinforcement learning is base on the current state an acts a behavior with a ranom value. For the action selection, we use the Boltzmann istribution to realize choice of ranom acts. 4 Experiment an Analysis In the following experiments, two Pioneer3-DX mobile robots are use as the experimental platform to research multi-robot formation. The robot is equippe with 6 sonar sensors, which can cover the 0 ~ 360 range aroun. 4. Robot's State Space The state space of robot is expresse as: s = { l, f, r}
Multi-robot Formation Control Using Reinforcement Learning Metho 67 l : The istance between the left sie of the robot an the obstacle, f : The istance between the front sie of the robot an the obstacle, r : The istance between the right sie of the robot an the obstacle. The above three parameters of state space are shown in Figure 2. X barrier heaing Object g l f robot r O Y Fig. 2. Position relationships between the robot, the obstacle an the goal As the maximum istance etecte by the sonar sensors is 5000mm, the minimum effective istance is 00mm, so 00 < f, r, l < 5000mm, that is, we will not consier the istance to obstacles of more than 5000mm or less than 00mm. using following formula to calculate the istance-weighte average. range = 0.5 f +0.25 r +0.25 l (9) The istances between the robots an the obstacles are ivie into three iscrete states as shown in Table : Table. Division of robot states (mm) State Small Mile Large range 0<range<500 500<range<2000 range>2000 Table 2. Q- Table of states-action pairs Small( s ) Mile( s 2 ) Large( s 3 ) Action( a ) Qs (, a ) Qs ( 2, a ) Action2( a ) Qs (, a ) Qs ( 2 2 2, a 2) Qs (, a ) 3 Qs (, a ) 3 2 At the same time, we efine two kins of robot behaviors: Action: keeping the original formation (keep the angle between leaer an follower to beθ = 0 o );
672 G. Zuo, J. Han, an G. Han Action2: transforming the original formation into a line formation (ajust the angle between leaer an follower to beθ = 80 o ). The state -action pairs are shown in Table 2: 4.2 Reinforcement Signals The reinforcement signal selection is a very important for reinforcement learning metho. Here, we use both internal an external reinforcement signals to reflect the iniviual's interests an the interests of the whole. ) Internal reinforcement signal: The internal reinforcement signal is use to evaluate the iniviual interests of the robots, which is efine by the istance between the robot an obstacles. l min is the Minimum istance. When l < l min, we think that the robot gets into a angerous place, so give it a punishment. l max is consiere as a safe istance. When l < l max, the robot is relatively safe, we give it awars. in - l < l r = l > l where f(l) is a linear function efine as: min min max f(l) l < l < l max min max 2 f () l = ( l l ) min ( l l ) (0) () 2) External reinforcement signal: The signal is use to regulate the group action for the overall interests. The group actions planne by each robot in each step of reinforcement learning may not be the inconsistent. In orer to make each follower robot keep the same behavior as the leaer, an election approach which uses the expecte behavior of most of the robots as a whole action, an then each robot implements the overall team behavior. Here we efine the external reinforcement signal for each robot as follows. If the robot behavior is consistent with the team behavior, the robot will be reware for this behavior; otherwise, it will be punishe. The external reinforcement signal is enote as: the robot's behavior is consistent with the team's ultimately behavior r = out (2) - otherwise The overall reinforcement signal is expresse as the weighte sum of internal an external reinforcement signal: whereα + β =. r = α r + β r (3) in out
Multi-robot Formation Control Using Reinforcement Learning Metho 673 Q function is efine as: The upate rules are: Qs (, a) = r+ max Qs (, a) (4) γ t t t t+ Δ Q ( s, a ) = r + max Q ( s, a) Q ( s, a ) (5) γ t t t t t t+ t t t Q ( s, a ) = Q ( s, a ) + α Δ Q ( s, a ) t t t t t t t t t t (6) At the beginning of the reinforcement learning, the main task of learning is to explore the environment, thus the ranomness of action selection shoul be greater, in the latter stage of reinforcement learning, learning shoul converge, so the ranomness of action selection shoul be smaller. Boltzmann machine is use here to carry out anneal operation. The probability of action a i was chosen as follows: Q( s, a )/ T i e Pa ( ) = i Q( s, a )/ T n e (7) a A n T = Tt 0 / β T 0 : the initial temperature value, t: Time, T: the current temperature value, obtaine from T 0 s ecay with time, β: Constant, use to control the rate of annealing. The ɛ -greey strategy is use to select action a i. In each action selection, use pa ( i ) = ε to ranomly select actions, an use -ɛ to select the action which has the final largest Q value [7]. Figure 3 shows the formation walking conitions after learning. (a) (b) (c) Fig. 3. The robots (a) maintain in a formation in spacious environment, (b) get into line formation when encountering obstacles, an (c) get back to the original formation after moving through the obstacles In the experiment, the leaer plans out a path from the beginning to the en. The follower changes the angular with the leaer in accorance with the changes of
674 G. Zuo, J. Han, an G. Han environment an is graually aapte to a new environment. The robots maintain the original formation in a spacious environment. As they go closer to the obstacle, they get into linear formation to pass through the obstacles. When the environment becomes more spacious again, the robots graually ajust to form the formation like the original in the first stage. Acknowlegments. This work is supporte by the National Natural Science Founation of China (No. 60873043) an the Funing Program for Acaemic Human Resources Development in Institutions of Higher Learning uner the Jurisiction of Beijing Municipality of China (No. PHR20008006). References. Shao, J.Y., Xie, G.M., Yu, J.Z., Wang, L.: Leaer-following Formation Control of Multiple Mobile Robots. In: Proeeeings of the 2005 IEEE International Symposium on Intelligent Control Limassol, June 27-29, pp. 808 83 (2005) 2. Jayev, J., Desai, P., Ostrowski, J.P., Kumar, V.: Moeling an Control of Formations of Nonholonomic Mobile Robots. IEEE transactions on robotics an automation 7(6), 905 908 (200) 3. Ruan, X.G., Cai, J.X., Chen, J.: Learnning to Control Two-Wheele Self-Balance Robot Using Reinforcement Learning Rules an Fuzzy Neural Network. In: Fourth International Conference on Natural Computation, pp. 395 398 (2008) 4. Ishikawa, M., Hagiwara, T., Yamamoto, N., Kiriake, F.: Brain-Inspire Emergence of Behaviors in Mobilr Robots by Reinforcement Learning with Internal Rewars. In: Eigth International Conference on Hybri Intelligent System, pp. 38 43 (2008) 5. Dung, L.T., Kokea, T., Takagi, M.: Reinforcement Learning in Non-Markovian Environments using Sutomatic Discovery of Subgoals. In: SICE Annual Conference, pp. 2600 2605 (2007) 6. Huang, B.Q.: Reinforcement learning metho an applie research. PhD thesis, Shanghai Jiaotong University, 22-23 (2007) 7. Gao, Y., Chen, S.F., Lu, X.: Research on reinforcement learning. Automatica Sinica 30(), 86 00 (2004)