Proactive MDP-based Collision Avoidance Algorithm for Autonomous Cars

Size: px

Start display at page:

Download "Proactive MDP-based Collision Avoidance Algorithm for Autonomous Cars"

Adrian Perry
5 years ago
Views:

Proactive MDP-based Collision Avoidance Algorithm for Autonomous Cars Denis Osipychev, Duy Tran, Weihua Sheng School of Electrical and Computer Engineering Oklahoma State University Stillwater,

1 Proactive MDP-based Collision Avoidance Algorithm for Autonomous Cars Denis Osipychev, Duy Tran, Weihua Sheng School of Electrical and Computer Engineering Oklahoma State University Stillwater, Oklahoma Girish Chowdhary School of Mechanical and Aerospace Engineering Oklahoma State University Stillwater, Oklahoma Ruili Zeng Department of Automobile Engineering Military Transportation University Tianjin, China Abstract This paper considers a decision making problem of an autonomous car driving through the intersection with the presence of human-driving cars. A proactive collision avoidance system based on a learning-based MDP model is proposed in contrast to a reactive system. This approach allows to pose the question as an optimization problem. The proposed learning algorithm explicitly describes the interaction with the environment through a probabilistic transition model. The effectiveness of this concept is supported by a variety of simulations which include driving behaviors with Gaussian-distributed velocity, random actions and real human driving. I. INTRODUCTION The high risk of collisions and the severity of the possible consequences remain to be the main properties of land transportation. Driving in the presence of other road users is a complex task achieved by human drivers only, but even they make wrong actions leading to lamentable statistics. Safe and reliable decision making is a major challenge for the use and popularization of the autonomous robotic vehicles. To fit the existing traffic manner the modern autonomous cars are expected to have fast reactive safety system and proactive predictive control algorithm [1], [2]. The reactive safety features warn the driver about difficulties on a road or even make the urgent actions to avoid the accidents. They were developed to surpass the human in time of reaction or excellence of sensors. Because of the use of modern detectors and fast computer logic, such systems had many successful implementations and prevented up to 80% of simulated collisions [3], [4]. For example, the completely reactive robotic system ALVINN uses images from the cameras and Neural Networks for reactive decision making [5]. Reactive safety systems have improved road safety by helping avoid collisions and accidents in the short term. However, further safety improvements are requiring increasing the sensitivity of the reactive systems that leads to an increase in the number of false alarms. Also, most of those systems were non-optimal and annoying to the passengers. Proactive safety allows to achieve a higher sensitivity to the potentially danger situations while taking softer actions. Despite the use of both proactive and reactive methods in This project is supported by the National Science Foundation NSF Grant CISE/IIS , CISE/IIS/ , National Natural Science Foundation of China NSFC Grants , and the Open Research Project of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China No. ICT1408. Fig. 1. Collision avoidance with the use of an intersection s infrastructure is possible in time domain by changing speed in advance. mobile robotics research, it is still a challenge for its adoption in transportation vehicles. There are existing works which recognize a driver s activities and act according to the likelihood of those or other activities. Most of these works consider the world as partially observed or completely hidden where the motivation and dynamics of the processes are not available while only the effects of certain actions can be observed [6], [7]. These works are giving an approximate solution or using heuristic approaches like if-else rules preprogrammed by the developer what does not allow to optimize a solution. This paper proposes the use of classical Markov Decision Process MDP to solve the problem. In this way, it allows us to find the best actions given full knowledge of the parameters of speed, direction and position for all involved vehicles. This condition can be satisfied by establishing RF connections between all cars and transferring the data to each other using V2V or V2I communication as has been explained in the work [8]. Due to the fact that up to 50% of accidents occurred at intersections, this paper introduced and verified the possibility of the use of the MDP framework for planning the actions of an autonomous vehicle the agent and checked the sufficiency of using proactive actions for avoiding collision. Fig. 1 illustrates a sufficiency of an early little change in the speed to avoid a collision in the time domain.

II. METHODOLOGY A. Learning-based MDP model optimization over expected reward In this section, we formulate the proactive decision making problem as an optimization problem.

2 II. METHODOLOGY A. Learning-based MDP model optimization over expected reward In this section, we formulate the proactive decision making problem as an optimization problem. For this purpose, the autonomous collision avoidance task is posed as an MDP tuple S,A,T,R that captures the Markovian transition of the car in the real world [9], [10]. Here, S is the set of discrete states of the car, A is the set of desired actions, T s,a,s is the transition model from any state s S to any other state s S when the action a A is taken, and denotes the conditional probability of transition ps a,s. R is the model of the reward obtained by the transition s,a,s. The value of each state is given by the value of the next state discounted by the discount factor γ and the cost of transition and mathematically described by the Bellman equation: V s = max a A T s,a,s Rs,a,s + γv s s S The optimal policy π a is the set of action for each state that maximizes the expected discounted reward: π = argmax E π Rs,a,s π s S There are many approaches to solving MDPs, some of which were surveyed in the recent papers [10], [7]. We chose the value-iteration algorithm due to its convergence guarantees. The proposed method was decomposed into the following steps: creating a dynamical model of a car, learning transition rules for the list of actions over dynamical simulations, solving MDP in order to find the optimal solution to pass the intersection, build dynamical simulation of an intersection to prove the method. B. Dynamic model of a vehicle In order to simulate a dynamics of a car, a simplified dynamical model of the Dubin s car has been described by the equations of motion based on the dynamic vehicle model [11]. It used six parameters to describe the real vehicle and environment: m : Mass of vehicle [kg] a : Distance from front axle to Center of Gravity [m] b : Distance from rear axle to Center of Gravity [m] C x : Longitudinal tire stiffness [N] C y : Lateral tire stiffness [N/rad] C A : Air resistance coefficient [1/m] In this simulation, we chose coefficients according to the Volvo V70 model as followed, m = 1700, a = 1.5, b = 1.5, C x = , C y = 4000, C A = 0.5. Three states of the model were taken into consideration: 1 2 x 1 t = v x t = Longitudinal velocity [m/s] 3 x 2 t = v y t = Lateral velocity [m/s] 4 x 3 t = rt = Yaw rate[rad/s] 5 where v x t and v y t represented longitudinal and lateral velocity. rt was the yaw rate at time t. The state-space structure of the model was illustrated by the following differential equations: dx 1 t = x 2 t x 3 t dt + m 1 [C x u 1 t + u 2 t cosu 5 t 2 C y u 5 t x 2t + a x 3 t sinu 5 t x 1 t +C x u 3 t + u 4 t C A x 1 t 2] 6 dx 2 t = x 1 t x 3 t dt + m 1 [C x u 1 t + u 2 t sinu 5 t + 2 C y u 5 t x 2t + a x 3 t x 1 t ] + 2 C y b x 3t x 2 t x 1 t dx 3 t 1 = dt 0.5 a + b 2 m {a [C x u 1 t + u 2 t sinu 5 t + 2 C y u 5 t x 2 + a x 3 t x 1 t } 2 b C y b x 3t x 2 t x 1 t cosu 5 t cosu 5 t Solving these ordinary differential equations ODE Eq. 6 8 explicitly was difficult. However, Runge-Kutta method [12] provided a numerical solution for the state of the vehiclevelocity, acceleration and yaw rate in every iteration. Fig. 2. An example of MDP formulation showing that some actions lead to the collision state. These actions should be marked by highly negative reward penalty. To utilize the discrete state MDP framework described in Section II-A, the continuous time dynamic model of the car has to be translated to a discrete-state transition model. The example of this translation is shown in Fig. 2. The collision state is defined as a state in a grid world which is occupied ] 7 8

by two cars in the same time. As can be inferred from the example, the only way to avoid the collision in the junction of two paths is to reach this point at the time different from another vehicle.

3 by two cars in the same time. As can be inferred from the example, the only way to avoid the collision in the junction of two paths is to reach this point at the time different from another vehicle. This approach enables time to be used as one of the states of the car and allows to separate dynamic states into static states by time steps. To maintain the connection between states the transition model is required. The uncertainty in transitions s s shown in Fig. 3 has to be described in terms of transition probabilities ps s,a. The distribution of the probabilities has to be found according to discrete actions performed by the agent. TABLE I. ACTION S DESCRIPTIONS AND PENALTIES N A Description of action Penalty 1 Keep going 0 2 Soft Speed up 0 3 Soft Slow down 0 4 Soft Turn left 0 5 Soft Turn right 0 6 Emergency stop Speed up Slow down Turn left Turn right -30 The set of actions can be decomposed into two main subsets: so called soft actions and firm actions. The soft actions are shown in Table I with number 1 to 5. Because of their smoothness and passengers-friliness, they were grouped as a preferred actions and defined as zero-cost actions. The firm actions with numbers 6 to 10 in Table I are rough actions which were used when the soft actions were not sufficient to prevent the collision with the costs defined accordingly to their preference. The durations of all actions were identical and defined by the time-step of the CAS algorithm equaled to 1 second. C. Learning of a discrete transition and reward model In this paper, to represent a dynamical state of the agent as a static state we choose 4 parameters: time, longitudinal and latitudinal locations on the road and velocity of the vehicle. These parameters forms a 4 dimensional set of nonoverlapping states while other parameters such as acceleration and orientation of the vehicle are neglected to reduce the number of states. These ignored parameters are assumed to be relatively small and can be set to initial values in a very short duration. Any state of the autonomous car can be classified by this discrete model of the world and be represented as a tuple: s = [time,loc X,loc Y,velocity] The resulting state-action transition matrix T s,s,a is very large and increases in size with the number of states. For the case considered in this paper, the set of all states forms 10 x 10 x 3 x 10 matrix, with 3000 initial states and same number of possible states for each of the 10 actions. This lead to a very large dimensional MDP with 90 millions elements 3000x3000x10. It should be noted that the dimensionality of the discretized state-space can be reduced by increasing the range over which the states are discretized, but this leads to other complexities such as high uncertainties in the transitions. To learn Transition model, this paper proposes the learning Algorithm 1, where one time step of CAS is divided to 10 incremental time steps equal to 0.1 second. Then, the Dynamic Simulation function, described in Section II-B, simulates the path with these steps and returns the [x, y] data of all 10 steps. This coordinates are linearly applied to all possible initial points [Loc x,loc y Road] equally distributed inside the one discrete location state and give the expected paths from these points. The obtained paths are being classified to the discrete states. The numbers of visits to these discrete states by taking one action give the conditional probability distribution of the vehicle inside one time step of the CAS. This process requires a lot of computational work, but the T matrix has to be obtained just once, and remains to be the same till dynamic model and parameters of the grid world are valid. Data: Car dynamic model D Result: Transition model T for every action a A do for every velocity v R do x = 0, y = 0, t = 0 ; while t inc t CAS do [x n,y n,v n,t n ] = Dx,y,t,t inc,v ; t inc = t inc +t CAS /10 ; for Loc x,loc y,time R do s = [Loc x,loc y,v,time] ; s n = [x n + Loc x,y n + Loc y,v n,t n +t] ; T s,a,s = ns s n S n ; Algorithm 1: Learning the Transition Model Fig. 3. Uncertainties in the transitions from one state by one actions may result different states due to stochasticity inside the initial state The reward function is designed to show the agent which states should be followed. We give a large negative reward

4 to the collisions, or to be more precise the states in which collision happens. To motivate the agent moves towards the intersection, the states at the other side of the intersection get the positive reward. This positive reward has reduction by time of obtaining this reward to avoid the following very slow and safe policy. All other states obtains the reward according to the cost of actions shown in the Table I. This formulation provides a great degree of flexibility in defining the priorities of actions and states. Rs,s collision,a = Rs,s 50 final,a = s time 10 Rs,s,a = Costa 11 D. CAS algorithm description Decision making Algorithm 2 for the CAS is based on the Bellman function shown in the Equation 1. We calculates the vector V s of the maximum values of state s using the T s,s,a and Rs,s,a matrices with respect to probability of the transition from this state to any resulting state and the cost of this transition. The output matrix Ps gives the best policy of actions. When the allocation of the penalty states in the matrix R is known, we have a map of actions for any state of the agent, regardless of where it had really been. This policy is relevant only for the specific location of penalties or distribution of the reward at the space. We could say that, regardless of other factors, once calculated policy should fit to any similar distribution of the rewards. By that, there is no need for constantly calculating the policies on-line, they could be precomputed in advance and stored as ready-made solutions in the database what let to save the time of calculation. Data: Transition model T, Reward model R Result: Optimal policy π while > η do for s S do v = V s ; V s = max a A s T s,a,s Rs,a,s + γv s ; πs = argmax π s T s,a,s Rs,a,s + γv s ; = max, v V s ; Algorithm 2: Value-iteration algorithm E. Simulation description To prove the viability of the concept the computer simulation has been built to describe the intersection where an autonomous vehicle has been moving from south to north. The simulation environment has been designed in the Matlab computing environment as an intersection where both autonomous and human-driving vehicles were involved. Fig. 4 illustrates the simulation of the vehicles passing the intersection where the green, blue and yellow rectangles represented the humandriving vehicles, while the red one was the autonomous driving vehicle. The Algorithm 3 utilizes the dynamical equations of Fig. 4. Simulation of the autonomous car at the bottom coming through the intersection with other human-driving cars the right ones all vehicles and updates their positions with a time interval of 10 ms. The short update interval is used to eliminate a possibility of skipping discrete states and avoid jumping one vehicle over another. The frequency of the CAS decision making algorithm has been set to 1 Hz once every second. Therefore, after each decision the agent continued to go by inertia for 1 second, until the next action is computed based on the evaluation of the environment. A delay in the implementation of the action is not taken into consideration due to an opportunity to define the dynamics of the car as a black box. Two generalized cases of the problem have been elaborated - the moving in the same direction to the traffic and in the transverse direction. The states of collisions are determined by classifying the visited states of the human-driving cars with the assumption of further move with fixed velocity. This makes possible to obtain the probability distribution of the intermediate states of all vehicles and assign the values of penalties to these states corresponding to their probabilities. Three role-models have been created to simulate a humandriving car. The first one reproduces holding the constant speed by the human driver. The car has been given the initial velocity while its further speed is defined according to the Gaussian probability distribution of the velocity in the previous step. The second model emulates a random selection of the action every second from the list of soft actions unified with the list of the agent s actions. It reproduces the intentional actions of the driver while he is driving. The third model is using a real human-driving. For this purpose, the data have been obtained from the driving through simulated intersection with the use of the Logitech G27 steering wheel and pedals to control the model of the car. Due to the large computational delay in calculating the transition matrix and policy, the human driving cannot be executed in real time. The pure data resulted by the human intention has been saved to a data file and reproduced by steps during CAS simulation. Thus, when the value-iteration algorithm is calculating the policy, the manual driving vehicle stops until the calculations gets finished. This

5 Data: Transition model T, Dynamic function D Result: Result of collision car n = [x n,y n,v n ],t = 0 ; while y y f inal R do [x n,y n,v n,t n ] = D n x n,y n,v n,t n,n = [0..3]; if t t CAS then S collision n = SAgent car n ; S collision R ; if R R prev then π = CASx,y,v,t,T,R ; a n=0 = πs; switch Human behavior model do case 1 v n=1..3 = Gaussianv n ; case 2 a n=1..3 = randoma A; case 3 v n=1..3 = load human.model ; sw a n=0..3 D n ; t = t ; Algorithm 3: Simulation algorithm Fig. 5. Transition model for actions: 1- keep going, 6- emergency brake, 7- speed up, 9- turn left, 10-turn right and speeds 1, 30, 60 mph. The probability of transition from the state marked by * is shown in gradations of red color. allows us to simulate the interaction with the real drivers as close as possible. It should be noted that none of these models performs the actions in the very aggressive manner aimed to commit an intentional crash. III. EVALUATION Transition matrix has been obtained by the simulation of the dynamic function of the agent. 10 small incremental steps within each time state have been checked and classified into discrete states and defined the conditional probability of being in any of this state. Thereby, 10 interim states have been tested for each of 10 actions in each of 3000 states resulting the classification of values of the dynamic functions. This process was the most computational-intensive despite the use of a simplified dynamic model. The resulted states of each action are shown in Figure 5 in tonal gradations with respect to its probability. As can be seen, this probability deps not only on the selected action, but on the vehicle s speed and location on the roadway as well. The quantitative simulations provided the data sufficient to compare the work of reactive and proactive systems during 100 trials with 8 simulations each including 2 different initial velocities 30 and 60 miles per hour and the presence of one and two human-driving cars with the Gaussian distribution of the speed. This quantitative simulation did not consider random-action and real-human models due to difficulties in the comparison. In all cases, there were obtained no car collisions and the significant improvement in the travel time through the intersection in contrast to the reactive systems. Fig. 6 shows the velocities of the agent denoted as car1 and the human driving cardenoted as car2 moving in transverse directions. Fig. 6. Agent Car1 and human Car2 velocities in random example, simulation stops when Agent pass intersection. Both human-driving and autonomous cars had initial velocities 30 mph 14 mps, shown at top figure and mps, shown at bottom one. As can be inferred from the figure, the time required to pass the intersection for the proactive algorithm is less6.1 and 4.5 seconds than for the reactive

6 the significantly lower maximum acceleration used to avoid collision, and improvement in travel time. The wider range of travel time and accelerations were resulted by originality of each solution found by MDP for each particular allocation of the cars. Fig. 7. Agent Car1 and human Car2, Car3 velocities in random example, simulation stops when Agent pass intersection. algorithm7.1 and 9.8 seconds. The actions performed by the proactive system were smoother and required less change in the speed what gave less discomfort to the passengers. The cases considered two human-driving cars are shown in Fig. 7. In all simulations the travel time was less for the proactive system for 25-30% and the agent avoided a complete stop in most cases when the use of soft actions was enough. Fig. 8. Max acceleration used and travel time comparison for MDP and reactive methods. The higher variances of MDP results are due to variety of solutions. The statistical data over all 100 trials shown in Fig. 8 gave IV. CONCLUSIONS Simulations of this approach proved the possibility of longterm planning the actions which avoid collisions with other cars. The CAS algorithm proposed in the paper has avoided collisions in all considered cases. Significant advantages over the reactive methods using a full-stop algorithm programed with if-else were reached in the travel time. Simulations showed that the delay was reduced by 25-50% for the case of the cross-traffic. The car has performed a full stop only when there was not enough distance to maintain the lower speed while other cars were passing through. However, the calculation of the optimal policy carried out on-line significantly delayed the CAS algorithm and can not be implemented as an on-line process on a real car. The only way to reduce the computation time is to avoid the change in the allocation of the penalties. This can be done by a prediction of the intention of other drivers. Human behavior can be learned and classified to several models which can be used for allocation of the penalty states. Another way is based on off-line calculating of all possible allocations of the penalties and combining them into groups with the unified solution which satisfied the whole group. This list of solutions will be used as a ready-made policy and can be considered as on-line. REFERENCES [1] G. Leen and D. Heffernan, Expanding automotive electronic systems, Computer, vol. 35, no. 1, pp , [2] J. Levinson et al., Towards fully autonomous driving: Systems and algorithms, in Intelligent Vehicles Symposium IV, 2011 IEEE. IEEE, 2011, pp [3] T. Li, S.-J. Chang, and Y.-X. Chen, Implementation of human-like driving skills by autonomous fuzzy behavior control on an fpga-based car-like mobile robot, Industrial Electronics, IEEE Transactions on, vol. 50, no. 5, pp , [4] R. Sukthankar, Raccoon: A real-time autonomous car chaser operating optimally at night, DTIC Document, Tech. Rep., [5] D. A. Pomerleau, Alvinn: An autonomous land vehicle in a neural network, DTIC Document, Tech. Rep., [6] T. Bandyopadhyay et al., Intention-aware pedestrian avoidance, in Experimental Robotics, pp [7] S. Brechtel, T. Gindele, and R. Dillmann, Probabilistic mdp-behavior planning for cars, in 14th International IEEE Conference on Intelligent Transportation Systems ITSC, 2011, pp [8] J. Santa, A. F. Gomez-Skarmeta, and M. Sanchez-Artigas, Architecture and evaluation of a unified v2v and v2i communication system based on cellular networks, Computer Communications, vol. 31, no. 12, pp , [9] R. Bellman, A markovian decision process, DTIC Document, Tech. Rep., [10] A. Geramifard et al A tutorial on linear function approximators for dynamic programming and reinforcement learning. [Online]. Available: [11] MathWorks. Modeling a vehicle dynamics system. [Online]. Available: [12] C. L. E. Hairer and M. Roche, The numerical solution of differentialalgebraic systems by Runge-Kutta methods. Springer, 1989.

Reinforcement Learning II

Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini