Multi-robot Formation Control Using Reinforcement Learning Method

Similar documents
TRACKING CONTROL OF MULTIPLE MOBILE ROBOTS: A CASE STUDY OF INTER-ROBOT COLLISION-FREE PROBLEM

A Path Planning Method Using Cubic Spiral with Curvature Constraint

Attitude Control System Design of UAV Guo Li1, a, Xiaoliang Lv2, b, Yongqing Zeng3, c

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Exponential Tracking Control of Nonlinear Systems with Actuator Nonlinearity

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Lecture Notes: March C.D. Lin Attosecond X-ray pulses issues:

An Iterative Incremental Learning Algorithm for Complex-Valued Hopfield Associative Memory

PD Controller for Car-Following Models Based on Real Data

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

Design A Robust Power System Stabilizer on SMIB Using Lyapunov Theory

A simplified macroscopic urban traffic network model for model-based predictive control

Sliding mode approach to congestion control in connection-oriented communication networks

FORMATION INPUT-TO-STATE STABILITY. Herbert G. Tanner and George J. Pappas

A study on ant colony systems with fuzzy pheromone dispersion

Lecture 6: Control of Three-Phase Inverters

The new concepts of measurement error s regularities and effect characteristics

Cascaded redundancy reduction

VIRTUAL STRUCTURE BASED SPACECRAFT FORMATION CONTROL WITH FORMATION FEEDBACK

Nonlinear Adaptive Ship Course Tracking Control Based on Backstepping and Nussbaum Gain

Lecture 6 : Dimensionality Reduction

Distributed coordination control for multi-robot networks using Lyapunov-like barrier functions

The Efficiency Optimization of Permanent Magnet Synchronous Machine DTC for Electric Vehicles Applications Based on Loss Model

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Lie symmetry and Mei conservation law of continuum system

An M/G/1 Retrial Queue with Priority, Balking and Feedback Customers

CE2253- APPLIED HYDRAULIC ENGINEERING (FOR IV SEMESTER)

Three-Dimensional Modeling of Green Sand and Squeeze Molding Simulation Yuuka Ito 1,a and Yasuhiro Maeda 2,b*

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Non-deterministic Social Laws

Optimal Signal Detection for False Track Discrimination

Number of wireless sensors needed to detect a wildfire

Situation awareness of power system based on static voltage security region

One-dimensional I test and direction vector I test with array references by induction variable

Distributed Force/Position Consensus Tracking of Networked Robotic Manipulators

Approximate reduction of dynamic systems

Power Generation and Distribution via Distributed Coordination Control

Table of Common Derivatives By David Abraham

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

3-dimensional Evolution of an Emerging Flux Tube in the Sun. T. Magara

6.003 Homework #7 Solutions

BEYOND THE CONSTRUCTION OF OPTIMAL SWITCHING SURFACES FOR AUTONOMOUS HYBRID SYSTEMS. Mauro Boccadoro Magnus Egerstedt Paolo Valigi Yorai Wardi

Topic 7: Convergence of Random Variables

Experimental Investigation on the Dynamic Shear Modulus and Damping Ratio of Aeolian Soils

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Level Construction of Decision Trees in a Partition-based Framework for Classification

Optimized Schwarz Methods with the Yin-Yang Grid for Shallow Water Equations

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Separation of Variables

CMSC 858F: Algorithmic Game Theory Fall 2010 BGP and Interdomain Routing

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

An Analysis of Reinforcement Learning with Function Approximation

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

A new identification method of the supply hole discharge coefficient of gas bearings

SPH4U UNIVERSITY PHYSICS

18 EVEN MORE CALCULUS

Experiment 2, Physics 2BL

Equilibrium in Queues Under Unknown Service Times and Service Value

The Principle of Least Action

2.5 SOME APPLICATIONS OF THE CHAIN RULE

Static Equilibrium. Theory: The conditions for the mechanical equilibrium of a rigid body are (a) (b)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

Free rotation of a rigid body 1 D. E. Soper 2 University of Oregon Physics 611, Theoretical Mechanics 5 November 2012

Define each term or concept.

Spring 2016 Network Science

Dynamics of Cortical Columns Self-Organization of Receptive Fields

Quantile function expansion using regularly varying functions

Appendix: Proof of Spatial Derivative of Clear Raindrop

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Lagrangian and Hamiltonian Mechanics

Implicit Lyapunov control of closed quantum systems

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Event based Kalman filter observer for rotary high speed on/off valve

STABILITY CONTROL FOR SIX-WHEEL DRIVE ARTICULATED VEHICLE BASED ON DIRECT YAW MOMENT CONTROL METHOD

Vehicle Stability Improvement Based on Electronic Differential Using Sliding Mode Control

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

OPTIMAL CONTROL PROBLEM FOR PROCESSES REPRESENTED BY STOCHASTIC SEQUENTIAL MACHINE

DYNAMIC PERFORMANCE OF RELUCTANCE SYNCHRONOUS MACHINES

x f(x) x f(x) approaching 1 approaching 0.5 approaching 1 approaching 0.

Robust Adaptive Control for a Class of Systems with Deadzone Nonlinearity

Application of Measurement System R&R Analysis in Ultrasonic Testing

An Adaptive Parameter Deflection Routing to Resolve Contentions in OBS Networks

Vectors in two dimensions

Analytic Scaling Formulas for Crossed Laser Acceleration in Vacuum

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

The chromatic number of graph powers

6. Friction and viscosity in gasses

Examining Geometric Integration for Propagating Orbit Trajectories with Non-Conservative Forcing

Numerical Integrator. Graphics

An efficient quantum meet-in-the-middle attack against NTRU-2005

FrIP6.1. subject to: V 0 &V 1 The trajectory to the target (x(t)) is generated using the HPFbased, gradient dynamical system:

Time Optimal Flight of a Hexacopter

A Modification of the Jarque-Bera Test. for Normality

A Quantitative Analysis of Coupling for a WPT System Including Dielectric/Magnetic Materials

CONTROL CHARTS FOR VARIABLES

MATLAB Simulation Research on Permanent Magnet Synchronous Motor Vector Control

Spatio-temporal fusion for reliable moving vehicle classification in wireless sensor networks

New Statistical Test for Quality Control in High Dimension Data Set

Transcription:

Multi-robot Formation Control Using Reinforcement Learning Metho Guoyu Zuo, Jiatong Han, an Guansheng Han School of Electronic Information & Control Engineering, Beijing University of Technology, Beijing 0024, China hjt925@63.com Abstract. Formation is a goo example of the research for multi-robot cooperation. Many ifferent ways can be use to accomplish this task, but the main rawbacks of most of these methos are that robots can t self-learn. In Brooks behavioral opinion, this paper is to verify that the reinforcement learning metho can be use for robots to select ifferent behaviors in various ifferent situations. Experiments are performe to illustrate the team robots capability of self-learning an autonomy. The results show that the robots can get a selfformation in a barrier environment after learning. Keywors: Multi-robot, Formation control, Reinforcement Learning. Introuction There are ifferent levels of nees in maintaining the robot formation s completeness uner various mission requirements. To obtain the goal of the robot formation moving to a specifie point, motion control shoul take into accounts the requirements of formation, such as avoiing obstacles an other factors. In this paper we mainly use Brooks behavioral inhibition metho, in which, at every moment, the formation task is specifie to be a series of actions, which are here efine as act l, act 2,..., act n. Using these efine behaviors, multi-robot system can perform the self-formation in an obstacle environment. Each robot's control system has a reinforcement learning moule to achieve the upper behavior control. Our aim is that through learning multirobot system can autonomously construct some goo pairs from environment to behavior. This mapping relation can make robots choose appropriate behaviors at each step in the environment to ensure the formation task s achievement without colliing with the obstacles. 2 The Leaer-Follower Control Metho The Leaer-Follower control metho is often use in multi-robot formation control, in which the team leaer ecies where the formation moves while avoiing obstacles, an the other robots, i.e., the followers, follow the leaer with a certain spee an angular velocity [, 2]. Y. Tan, Y. Shi, an K.C. Tan (Es.): ICSI 200, Part I, LNCS 645, pp. 667 674, 200. Springer-Verlag Berlin Heielberg 200

668 G. Zuo, J. Han, an G. Han Figure is the l ϕ control methos. The follower always maintains following its leaer, ensuring that l l anϕ ϕ, where l is the istance between two robots, 2 2 ϕ is the angle between them, an is the istance between the centres of robot s two wheel axes an the robot s rotation. θ 2 l 2 ϕ 2 θ Fig.. Moel of l ϕ The kinematics equations of the leaer robot are: x = v cosθ i i y = v sinθ i i () θ = ω i i here v i an ω, i=(,2) represent the line spee an angular velocity of the two robots, i respectively. The kinematics equations of the follower robot are shown as follows: l 2 = v cos γ v cosϕ + ω sin γ 2 2 2 ϕ = { v sin ϕ v sin γ + ω cos γ l ω } l 2 2 2 2 2 2 θ = ω 2 2 γ = θ + ϕ θ 2 2 (2) For the follower, the control output ( ω, v ) is: 2 2 cos γ { ( ) sin sin } = ρ ω tan γ ω = al ϕ ϕ v ϕ + l ω + ρ γ 2 2 2 2 2 2 2 2 v 2 2 2 (3)

Multi-robot Formation Control Using Reinforcement Learning Metho 669 here, ρ 2 a ( l l ) v cosϕ 2 2 2 = + (4) cos γ cos γ So the l ϕ close-loop control is expresse as: l = a ( l l ) 2 2 2 ϕ = a ( ϕ ϕ ) 2 2 2 2 where a an a 2 are the coefficients of the proportional control metho. (5) 3 Behavioral Design of Multi-robot Formation Using Reinforcement Learning If there are obstacles in the environment, the robot can not pass through the barriers while maintaining the original formation. In general, they nee to change their formation in ifferent types. When passing through a narrow obstacle, the formation usually converts into a line formation. That is, the followers nee to change the following angle with the leaer in orer to pass through the obstacles. After the followers move through the obstacles an come to a spacious environment, they change the angle again to return to the original formation. Here we use a reinforcement learning metho to esign the behaviors of the followers. Q learning is a reinforcement learning algorithm. The main iea of it is not to learn each state of the evaluation function, but to learn each state-action pairs evaluation function Q(s, a). Q(s, a) signifies the state s cumulative iscounte value after performing actions. When the unknown or ynamic enviroment changes, Q learning has a great avantage. It oes not nee to know the next state after actions, but nee to rely on the current state-action pairs s Q values to etermine the optimal strategy in state s [3, 4]. Qs (, a) = r+ max Qs (, a) (6) γ t t t t+ Equation (6) is set up only uner the conitions of the optimal strategy. In the reinforcement learning process, the two ens of equation (6) is not strictly equal, an the error is expresse as: Δ Q ( s, a ) = r + max Q ( s, a) Q ( s, a ) (7) γ t t t t t t+ t t t Q ( s, a ) = Q ( s, a ) + α ΔQ ( s, a ) t t t t t t t t t t = Q ( s, a ) + α [ r + γ max Q ( s, a) Q ( s, a )] t t t t t t t+ t t t = ( α ) Q ( s, a ) + α [ r + γ max Q ( s, a)] t t t t t t t t+ = ( α ) Q ( s, a ) + α [ r + γv ( s )] t t t t t t t t+ (8)

670 G. Zuo, J. Han, an G. Han where, V ( s ) = max Q ( s, a) t t+ t t+ α : learning rate in the moment of t, t γ : iscount factor for Vs ( ), t + r : the reinforcement signal at the moment of t to imply the action a t t. The interactions between the robots an environment are as follows: ) The robot perceives the current environment state s t, 2) Obtain a timely rewar for the current state of the environment, an perform the appropriate action accoring to a certain strategic selecte, 3) Perform the selecte actions, an the environment changes, 4) Get the next state of the environment s t +, 5) Calculate the rewar r t timely, 6) t t +, move to step 7 if the learning objectives have been achieve, otherwise, turn step 2, 7) The en of the learning process. The reinforcement learning process is as follows: each robot gets the information of environment through the by its own sonar information coming from the other robots. This type of sensor information is sent to the reinforcement learning moule (Q-learning). The reinforcement learning moule ecies to select which action to behave in accorance with all of the robot's sensor signal: act, act 2,..., act n, an the robots take the appropriate actions in the environment. The robot s environmental system will give every action ifferent enhance signal values base on the role of behavior. The system will ecie the tren of such actions, i. e., whether they will be strengthene or weakene in such an environment. An eventually the system will learn the ifferent circumstances to take appropriate actions to achieve a selfformation in the obstacles space [5, 6]. Reinforcement learning is base on the current state an acts a behavior with a ranom value. For the action selection, we use the Boltzmann istribution to realize choice of ranom acts. 4 Experiment an Analysis In the following experiments, two Pioneer3-DX mobile robots are use as the experimental platform to research multi-robot formation. The robot is equippe with 6 sonar sensors, which can cover the 0 ~ 360 range aroun. 4. Robot's State Space The state space of robot is expresse as: s = { l, f, r}

Multi-robot Formation Control Using Reinforcement Learning Metho 67 l : The istance between the left sie of the robot an the obstacle, f : The istance between the front sie of the robot an the obstacle, r : The istance between the right sie of the robot an the obstacle. The above three parameters of state space are shown in Figure 2. X barrier heaing Object g l f robot r O Y Fig. 2. Position relationships between the robot, the obstacle an the goal As the maximum istance etecte by the sonar sensors is 5000mm, the minimum effective istance is 00mm, so 00 < f, r, l < 5000mm, that is, we will not consier the istance to obstacles of more than 5000mm or less than 00mm. using following formula to calculate the istance-weighte average. range = 0.5 f +0.25 r +0.25 l (9) The istances between the robots an the obstacles are ivie into three iscrete states as shown in Table : Table. Division of robot states (mm) State Small Mile Large range 0<range<500 500<range<2000 range>2000 Table 2. Q- Table of states-action pairs Small( s ) Mile( s 2 ) Large( s 3 ) Action( a ) Qs (, a ) Qs ( 2, a ) Action2( a ) Qs (, a ) Qs ( 2 2 2, a 2) Qs (, a ) 3 Qs (, a ) 3 2 At the same time, we efine two kins of robot behaviors: Action: keeping the original formation (keep the angle between leaer an follower to beθ = 0 o );

672 G. Zuo, J. Han, an G. Han Action2: transforming the original formation into a line formation (ajust the angle between leaer an follower to beθ = 80 o ). The state -action pairs are shown in Table 2: 4.2 Reinforcement Signals The reinforcement signal selection is a very important for reinforcement learning metho. Here, we use both internal an external reinforcement signals to reflect the iniviual's interests an the interests of the whole. ) Internal reinforcement signal: The internal reinforcement signal is use to evaluate the iniviual interests of the robots, which is efine by the istance between the robot an obstacles. l min is the Minimum istance. When l < l min, we think that the robot gets into a angerous place, so give it a punishment. l max is consiere as a safe istance. When l < l max, the robot is relatively safe, we give it awars. in - l < l r = l > l where f(l) is a linear function efine as: min min max f(l) l < l < l max min max 2 f () l = ( l l ) min ( l l ) (0) () 2) External reinforcement signal: The signal is use to regulate the group action for the overall interests. The group actions planne by each robot in each step of reinforcement learning may not be the inconsistent. In orer to make each follower robot keep the same behavior as the leaer, an election approach which uses the expecte behavior of most of the robots as a whole action, an then each robot implements the overall team behavior. Here we efine the external reinforcement signal for each robot as follows. If the robot behavior is consistent with the team behavior, the robot will be reware for this behavior; otherwise, it will be punishe. The external reinforcement signal is enote as: the robot's behavior is consistent with the team's ultimately behavior r = out (2) - otherwise The overall reinforcement signal is expresse as the weighte sum of internal an external reinforcement signal: whereα + β =. r = α r + β r (3) in out

Multi-robot Formation Control Using Reinforcement Learning Metho 673 Q function is efine as: The upate rules are: Qs (, a) = r+ max Qs (, a) (4) γ t t t t+ Δ Q ( s, a ) = r + max Q ( s, a) Q ( s, a ) (5) γ t t t t t t+ t t t Q ( s, a ) = Q ( s, a ) + α Δ Q ( s, a ) t t t t t t t t t t (6) At the beginning of the reinforcement learning, the main task of learning is to explore the environment, thus the ranomness of action selection shoul be greater, in the latter stage of reinforcement learning, learning shoul converge, so the ranomness of action selection shoul be smaller. Boltzmann machine is use here to carry out anneal operation. The probability of action a i was chosen as follows: Q( s, a )/ T i e Pa ( ) = i Q( s, a )/ T n e (7) a A n T = Tt 0 / β T 0 : the initial temperature value, t: Time, T: the current temperature value, obtaine from T 0 s ecay with time, β: Constant, use to control the rate of annealing. The ɛ -greey strategy is use to select action a i. In each action selection, use pa ( i ) = ε to ranomly select actions, an use -ɛ to select the action which has the final largest Q value [7]. Figure 3 shows the formation walking conitions after learning. (a) (b) (c) Fig. 3. The robots (a) maintain in a formation in spacious environment, (b) get into line formation when encountering obstacles, an (c) get back to the original formation after moving through the obstacles In the experiment, the leaer plans out a path from the beginning to the en. The follower changes the angular with the leaer in accorance with the changes of

674 G. Zuo, J. Han, an G. Han environment an is graually aapte to a new environment. The robots maintain the original formation in a spacious environment. As they go closer to the obstacle, they get into linear formation to pass through the obstacles. When the environment becomes more spacious again, the robots graually ajust to form the formation like the original in the first stage. Acknowlegments. This work is supporte by the National Natural Science Founation of China (No. 60873043) an the Funing Program for Acaemic Human Resources Development in Institutions of Higher Learning uner the Jurisiction of Beijing Municipality of China (No. PHR20008006). References. Shao, J.Y., Xie, G.M., Yu, J.Z., Wang, L.: Leaer-following Formation Control of Multiple Mobile Robots. In: Proeeeings of the 2005 IEEE International Symposium on Intelligent Control Limassol, June 27-29, pp. 808 83 (2005) 2. Jayev, J., Desai, P., Ostrowski, J.P., Kumar, V.: Moeling an Control of Formations of Nonholonomic Mobile Robots. IEEE transactions on robotics an automation 7(6), 905 908 (200) 3. Ruan, X.G., Cai, J.X., Chen, J.: Learnning to Control Two-Wheele Self-Balance Robot Using Reinforcement Learning Rules an Fuzzy Neural Network. In: Fourth International Conference on Natural Computation, pp. 395 398 (2008) 4. Ishikawa, M., Hagiwara, T., Yamamoto, N., Kiriake, F.: Brain-Inspire Emergence of Behaviors in Mobilr Robots by Reinforcement Learning with Internal Rewars. In: Eigth International Conference on Hybri Intelligent System, pp. 38 43 (2008) 5. Dung, L.T., Kokea, T., Takagi, M.: Reinforcement Learning in Non-Markovian Environments using Sutomatic Discovery of Subgoals. In: SICE Annual Conference, pp. 2600 2605 (2007) 6. Huang, B.Q.: Reinforcement learning metho an applie research. PhD thesis, Shanghai Jiaotong University, 22-23 (2007) 7. Gao, Y., Chen, S.F., Lu, X.: Research on reinforcement learning. Automatica Sinica 30(), 86 00 (2004)