GPU Accelerated Markov Decision Processes in Crowd Simulation

Size: px

Start display at page:

Download "GPU Accelerated Markov Decision Processes in Crowd Simulation"

Emmeline Lawson
6 years ago
Views:

Tecnológico de Monterrey, CCM Mexico City, México

mx Benjamín Hernández National Center for

1 GPU Accelerated Markov Decision Processes in Crowd Simulation Sergio Ruiz Computer Science Department Tecnológico de Monterrey, CCM Mexico City, México Benjamín Hernández National Center for Computational Sciences Oak Ridge National Laboratory Tennessee, USA

2 Contents Introduction Optimization Approaches Problem solving strategy A simple example Algorithm description Results Conclusions & future work 2

3 Crowd Simulation Path Planning Local Collision Avoidance (LCA) 3

4 Optimization Approaches According to (Reyes et al. 2009, Foka and Trahanias 2003), Markov Decision Processes (MDPs) are computationally inefficient: as the state space grows, the problem becomes intractable. Decomposition offers the possibility to solve large MDPs (Sucar 2007, Meuleau et al. 1998, Singh and Cohn 1998), either in State Space decomposition, or Process decomposition. (Mausam and Weld. 2004) follow the idea of concurrency to solve MDPs generating solutions close to optimal extending the Labeled Real-time Dynamic Programming method. 4

5 Optimization Approaches (Sucar 2007) proposes a parallel implementation of weakly coupled MDPs. (Jóhansson 2009) presents a dynamic programming framework that implements the Value Iteration algorithm to solve MDPs using CUDA. (Noer 2013) explores the design and implementation of a pointbased Value Iteration algorithm for Partially Observable MDPs (POMDPs) with approximate solutions. The GPU implementation supports belief stat pruning which avoids calculations. 5

6 Problem Solving Strategy We propose a parallel Value Iteration MDP solving algorithm to guide groups of agents toward assigned goals while avoiding obstacles interactively. For optimal performance the algorithm is run over a hexagonal grid in the context of a Fully Observable MDP. 6

7 Problem Solving Strategy A Markov Decision Process is a tuple M = S, A, T, R S is a finite set of states. In our case, 2D cells. A is a finite set of actions. In our case, 6 directions. T is a transition model T(s, a, s ). R is a reward function R(s). A policy π is a solution that specifies the action for an agent at a given state. π is the optimal policy. Transition 7

8 Problem Solving Strategy Value Iteration π t s = argmax a Q t s, a Q t s, a = R s, a + γ T sj a V t 1 j 5 j=0 States V t s = Q t s, π s ; V 0 s = 0 8

9 Problem Solving Strategy We propose to temporarily override the optimal policy when agent density in a cell is above a certain threshold σ. 9

A simplified example 1 2 3 4 a -3-3 -3 +100 b -3-3 -100 c -3-3 -3-3 A = { N, W, E } γ = 1 (for simplicity) Transitions: p = 0.8 (probability of taking a current action) q = 0.

10 A simplified example a b c A = { N, W, E } γ = 1 (for simplicity) Transitions: p = 0.8 (probability of taking a current action) q = 0.1 (probability of taking another action) π t s = argmax a Q t s, a Q t s, a = R s, a + γ T sj a V t 1 j 2 j=0 What is π for cell a3? π a3 = max{q a3, W, Q a3, N, Q a3, E } Q a3, E = (0.8(100) + 0.1(-3) + 0.1(0)) Q a3, W = (0.1(100) + 0.8(-3) + 0.1(0)) Q a3, N = (0.1(100) + 0.1(-3) + 0.8(0)) => max is Q a3, E Q a3, E = ( 0.8(100) + 0.1(-3) + 0.1(0) ) Q a3, W = ( 0.1(100) + 0.8(-3) + 0.1(0) ) Q a3, N = ( 0.1(100) + 0.1(-3) + 0.8(0) ) R s, a γ 2 j=0 T sj a V j 10

11 Algorithm Q a3, E = ( 0.8(100) + 0.1(-3) + 0.1(0) ) Q a3, W = ( 0.1(100) + 0.8(-3) + 0.1(0) ) Q a3, N = ( 0.1(100) + 0.1(-3) + 0.8(0) ) R s, a Data collect: current cell needs to know rewards from neighboring cells and out of bound values. Input generation: build T sj a and R s, a = RW Value Iteration: optimal policy computed using parallel transformations and parallel reduction by key. γ 2 j=0 T sj a V j 11

12 Algorithm: input generation Transition matrix requirements: T P = p p p p T Q r,c = q i q i q i q i D A = D B = Dimensions: A x A i.e. each cell can compute neighboring info r 1, MDP rows q i = q RE i 1 c 1, MDP columns 12

Algorithm: input generation where T r,c = T p D A + T Q r,c D B = p q q q p q q q p Q a3, E = 100 + 1.0 ( 0.8(100) + 0.1(-3) + 0.1(0) ) Q a3, W = -3 + 1.0 ( 0.1(100) + 0.8(-3) + 0.

13 Algorithm: input generation where T r,c = T p D A + T Q r,c D B = p q q q p q q q p Q a3, E = ( 0.8(100) + 0.1(-3) + 0.1(0) ) Q a3, W = ( 0.1(100) + 0.8(-3) + 0.1(0) ) Q a3, N = ( 0.1(100) + 0.1(-3) + 0.8(0) ) Transition matrix T sj a computation: Represents a Cell T sj a = T 1,1 T 1,MDPcolumns T MDProws,1 T MDProws,MDP columns 13

14 Algorithm: Parallel Value Iteration 1. Computation of Q-values. π t = RW + γ T sj a V Consecutive parallel transformations (mult, mult, sum) results in a matrix Q that stores A -tuple of policies for taking all actions per each cell. 14

15 Algorithm: Parallel Value Iteration 2. Selection of best Q-values. Parallel reduction: from every consecutive A -tuple in π t, the largest value index indicates current best policy. 3. Check for convergence. If π t π t 1 = [0,, 0] 15

16 Crowd Navigation Video 16

Results: test scenarios Office (1,584 cells) Maze (100x100

Thrust, OpenMP and CUDA Backbends CPU: Intel Core i7 CPU

ARM (Jetson TK1): 32 bit ARM quad-core Cortex-A15 CPU

17 Results: test scenarios Office (1,584 cells) Maze (100x100 cells) Champ de Mars (100x100 cells) Implementation: CUDA Thrust, OpenMP and CUDA Backbends CPU: Intel Core i7 CPU running at 3.40GHz. ARM (Jetson TK1): 32 bit ARM quad-core Cortex-A15 CPU running at 2.32GHz. GPUs: Tegra K1 192 CUDA Cores, Tesla K40c 2880 CUDA cores, Geforce GTX TITAN 2688 CUDA cores. 17

18 18 Results: GPU performance

19 Results: GPU speedup Intel CPU baseline: 8 threads ARM CPU baseline: 4 threads 19

20 Conclusion Parallelization of the proposed algorithm was made possible by formulating it in terms of matrix operations, leveraging the massive data parallelism in GPU computing to reduce the MDP solution time. We demonstrated that standard parallel transformation and reduction operations provide the means to solve MDPs via Value Iteration with optimal performance. 20

21 Conclusion Taking advantage of the proposed hexagonal grid partitioning method, our implementation provides a good level of space discretization and performance. We obtained a 90x speed up using GPUs enabling us to simulate crowd behavior interactively. We found the Jetson TK1 GPU to have a remarkable performance, opening many possibilities to incorporate real-time MDP solvers in mobile robotics. 21

22 Future Work Reinforcement learning. Evaluate different parameter values to obtain policy convergence in the least number of iterations without losing precision in the generated paths. Couple the MDP solver with a Local Collision Avoidance method to obtain more precise simulation results at microscopic level. Investigate further applications of our MDP solver beyond the context of crowd simulation. 22

23 GPU Accelerated Markov Decision Process in Crowd Simulation Further reading: Ruiz, S. Hernandez, B. A parallel solver for Markov Decision Process in Crowd Simulation MICAI 2015, 14th Mexican International Conference on Artificial Intelligence, At Cuernavaca, Mexico, IEEE volume: ISBN Sergio Ruiz Computer Science Department Tecnológico de Monterrey, CCM Mexico City, México sergio.ruiz.loza@itesm.mx Thank you! Benjamín Hernández National Center for Computational Sciences Oak Ridge National Laboratory Tennessee, USA hernandezarb@ornl.gov This research was partially supported by: CONACyT SNI-54067, CONACyT PhD scholarship , Nvidia Hardware Grant and Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, under DOE Contract No. DE-AC05-00OR22725.

24 24 Additional Results: Intel CPU

25 25 Additional Results: ARM CPU

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty