Fast SSP Solvers Using Short-Sighted Labeling

Size: px

Start display at page:

Download "Fast SSP Solvers Using Short-Sighted Labeling"

Kristin Bruce
5 years ago
Views:

1 Luis Pineda, Kyle H. Wray and Shlomo Zilberstein College of Information and Computer Sciences, University of Massachusetts, Amherst, USA July 9th

2 Introduction Motivation SSPs are a highly-expressive model for sequential decision making They can be used for decision-making in the presence of multiple goals Caveat: solving SSPs optimally is a very computationally intensive task

3 Introduction Motivation SSPs are a highly-expressive model for sequential decision making They can be used for decision-making in the presence of multiple goals Caveat: solving SSPs optimally is a very computationally intensive task

4 Introduction Motivation A range of model reduction and heuristic search techniques for solving SSPs are available But even restricting only to states in optimal policies can be prohibitive

5 Introduction Motivation Recent approaches attempt to reduce the reachable state space even more and use re-planning during execution However, they still have several drawbacks: Restricted to particular problem representations (e.g., RFF and FF-Replan) Involve pre-processing (e.g., M k l reduction) Result in moderate reductions in time (e.g., SSiPP)

6 Introduction In this work... We introduce a new algorithm called FLARES (Fast Labeling from Residuals Using Samples) Modifies LRTDP to find high-performing policies much faster It can be extended to also find optimal policies

7 Background Stochastic Shortest Path Problems An SSP is a tuple xs, A, T, C, s 0, s g y, where: S is a finite set of states A is a finite set of actions T ps 1 s, aq P r0, 1s is a transition function Cps, aq P p0, 8q is a cost function s 0 is an initial state s g is a goal state

8 Background Solutions to SSPs The optimal value function for an SSP can be found using: BUpsq : min!cps, aq ` ÿ ) T ps 1 s, aqv ps 1 q apa πpsq arg min apa s 1 PS!Cps, aq ` ÿ s 1 PS ) T ps 1 s, aqv ps 1 q (1) (2)

9 Motivation for FLARES Problems with optimal heuristic search algorithms S G

10 Motivation for FLARES Depth-limited search S G

11 Motivation for FLARES Depth-limited search S x G

12 Motivation for FLARES Depth-limited search Large horizon needed to propagate information from states close to the goal S x G

13 Motivation for FLARES RTDP with short-sighted SSPs S G

14 Motivation for FLARES RTDP with short-sighted SSPs S G

15 Motivation for FLARES RTDP with short-sighted SSPs This approch doesn't seem to work well in practice, typically resulting in worse anytime performance than plain RTDP S G

16 FLARES Labeling states (LRTDP) S If states reachable by following the greedy policy have low residual, then we can label all of them as solved G

17 FLARES Labeling states S If states reachable by following the greedy policy have low residual, then we can label all of them as solved G

18 FLARES Labeling states S Computation can be saved by using cached values and actions for labeled states during the search G

19 FLARES But......to label s, all states reachable from s must be checked...

20 FLARES Labeling states can also be costly S G

21 FLARES FLARES Proposed approach: Move the short-sightedness to the labeling

22 FLARES FLARES Reachable states S Low residual states up to horizon 2t G Labeled states up to horizon t

23 FLARES FLARES S G Similar to the RTDP approach mentioned before... except that now computation can be reused and... there is a crisp termination condition that exploits short-sightedness

24 FLARES FLARES S G Similar to the RTDP approach mentioned before... except that now computation can be reused and... there is a crisp termination condition that exploits short-sightedness

25 FLARES FLARES S G Similar to the RTDP approach mentioned before... except that now computation can be reused and... there is a crisp termination condition that exploits short-sightedness

26 Theoretical properties of FLARES Correctness of labeling procedure Proposition FLARES labels a state s with s.solv true only if all states s 1 that can be reached from following the greedy policy satisfy Rps 1 q ă ɛ. Proposition As long as a Bellman update of s 1 with Rps 1 q ă ɛ never results in Rps 1 q ě ɛ, then FLARES labels a state s with s.d-solv true only if s is depth-t-solved.

27 Theoretical properties of FLARES Correctness of labeling procedure Proposition FLARES labels a state s with s.solv true only if all states s 1 that can be reached from following the greedy policy satisfy Rps 1 q ă ɛ. Proposition As long as a Bellman update of s 1 with Rps 1 q ă ɛ never results in Rps 1 q ě ɛ, then FLARES labels a state s with s.d-solv true only if s is depth-t-solved.

28 Theoretical properties of FLARES FLARES is guaranteed to terminate Theorem If the heuristic is admissible and monotone, FLARES terminates after at most ɛ 1 ř sps V psq V psq trials.

29 Theoretical properties of FLARES An optimal version of FLARES An optimal version of FLARES can be produced by calling FLARES multiple times with increasing horizon t Theorem If the initial heuristic is admissible and monotone, and ρ t, ρpv, tq ą t, with t 0 ě 0, then OPT-FLARES computes an optimal policy.

30 Theoretical properties of FLARES An optimal version of FLARES An optimal version of FLARES can be produced by calling FLARES multiple times with increasing horizon t Theorem If the initial heuristic is admissible and monotone, and ρ t, ρpv, tq ą t, with t 0 ě 0, then OPT-FLARES computes an optimal policy.

31 Gridworld domain Gridworld domain results Goal 2 25 x 100 cells 25 x 100 cells Start Goal 1 algorithm cost time LRTDP FLARES(0) FLARES(1) HDP(0,0) HDP(4,0) HDP(4,4) SSiPP(16) SSiPP(32) SSiPP(64)

32 Racetrack domain Racetrack domain - Average costs

33 Racetrack domain Racetrack domain - Average planning time (seconds) square-4 square-5 ring-5 ring-6 LRTDP FLARES(0) FLARES(1) HDP(0,0) HDP(0,1) HDP(1,0) HDP(1,1) SSiPP(4) SSiPP(8)

34 Sailing domain Sailing domain - Average costs 250 size=(20,20) goal=(10,10) OPT F(0) F(1) H(0,0) H(0,1) H(1,0) H(1,1) S(4) S(8) 600 size=(40,40) goal=(20,20) OPT F(0) F(1) H(0,0) H(0,1) H(1,0) H(1,1) S(4) S(8) 300 size=(20,20) goal=(19,19) OPT F(0) F(1) H(0,0) H(0,1) H(1,0) H(1,1) S(4) S(8) 600 size=(40,40) goal=(39,39) OPT F(0) F(1) H(0,0) H(0,1) H(1,0) H(1,1) S(4) S(8)

35 Sailing domain Sailing domain - Average planning time (seconds) s=20 g=corner s=40 g=corner s=20 g=middle s=40 g=middle LRTDP FLARES(0) FLARES(1) FLARES(2) HDP(0,0) HDP(0,1) HDP(1,0) HDP(1,1) SSiPP(4) SSiPP(8)

36 Conclusions Conclusions We present an approach for short-sightedness in SSPs that applies it only for labeling Allows larger sections of the state to be explored and accelerate running times Based on this idea, we introduce a novel extension of LRTDP called FLARES Experimental results suggest that FLARES can produce near-optimal policies orders of magnitude faster than other state-of-the-art MDP solvers

37 Conclusions Conclusions We present an approach for short-sightedness in SSPs that applies it only for labeling Allows larger sections of the state to be explored and accelerate running times Based on this idea, we introduce a novel extension of LRTDP called FLARES Experimental results suggest that FLARES can produce near-optimal policies orders of magnitude faster than other state-of-the-art MDP solvers

38 Conclusions Conclusions We present an approach for short-sightedness in SSPs that applies it only for labeling Allows larger sections of the state to be explored and accelerate running times Based on this idea, we introduce a novel extension of LRTDP called FLARES Experimental results suggest that FLARES can produce near-optimal policies orders of magnitude faster than other state-of-the-art MDP solvers

39 Conclusions Conclusions We present an approach for short-sightedness in SSPs that applies it only for labeling Allows larger sections of the state to be explored and accelerate running times Based on this idea, we introduce a novel extension of LRTDP called FLARES Experimental results suggest that FLARES can produce near-optimal policies orders of magnitude faster than other state-of-the-art MDP solvers

40 Conclusions Conclusions We present an approach for short-sightedness in SSPs that applies it only for labeling Allows larger sections of the state to be explored and accelerate running times Based on this idea, we introduce a novel extension of LRTDP called FLARES Experimental results suggest that FLARES can produce near-optimal policies orders of magnitude faster than other state-of-the-art MDP solvers

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost