Stochastic Safest and Shortest Path Problems

Size: px

Start display at page:

Download "Stochastic Safest and Shortest Path Problems"

Corey Jennings
6 years ago
Views:

2 Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012

3 Path optimization under probabilistic uncertainties Problems coming to searching for a shortest path in a probabilistic AND/OR cyclic graph OR nodes : branch choice (action) AND nodes : probabilistic outcomes of chosen branches (actions effects) Problem statement : to compute a policy to go to the goal with maximum probability or minimum expected cost-to-go Examples : Shortest path planning in probabilistic grid worlds (racetrack) Minimum number of moves of blocks to build towers with stochastic operators (exploding-blocksworld) Controller synthesis for critical systems, with maximum terminal disponibility and minimum energy consumption (embedded systems, transportation systems, servers, etc.) 3/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

4 Mathematical framework : goal-oriented MDP Goal-oriented Markov Decision Process S : finite set of states G S : finite set of goals A : finite set of actions S A S [0; 1] T : (s, a, s ) Pr(s t+1 = s : s t = s, a t = a) transition function c : S A S R : cost function associated with the transition function Absorbing goal states : g G, a A, T(g, a, g) = 1 No costs paid from goal states : g G, a A, c(g, a, g) = 0 app : S 2 A : applicable actions in a given state 4/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

5 Stochastic Shortest Path (SSP) [Bertsekas & Tsitsiklis (1996)] Optimization criterion : total (undiscounted) cost, or cost-to-go Find a Markovian policy π : S A that minimizes the expected total cost from any possible initial state [ + ] s S, π (s) = argmin V π (s) = E c t s 0 = s π A S t=0 Value of π solution of Bellman equation ( ) s S, V π (s) = min T(s, a, s ) V π (s ) + c(s, a, s ) a app(s) s S 5/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

6 SSP : required theoretical and practical assumptions Assumption 1 There exists at least one proper policy, i.e. : a policy that reaches the goal with probability 1 regardless of the initial state. Assumption 2 For every improper policy, the corresponding cost-to-go is infinite, i.e. : all cycles not leading to the goal are composed of strictly positive costs. Implications if both assumptions 1 and 2 hold There exists a policy π such that V π is finite ; An optimal Markovian (stationary) policy can be obtained using dynamic programming (Bellman equation). 6/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

7 Drawbacks of the SSP criterion SSP assumptions not easy to check in practice Deciding whether assumptions 1 and 2 hold not obvious in general Same complexity class as optimizing the cost-to-go criterion Limited practical scope Limited to optimizing policies reaching the goal with probability 1, without nonpositive-cost cycles not leading to the goal Especially annoying in presence of dead-ends or nonpositive-cost cycles In the absence of proper policies, no known method to optimize both the probability of reaching the goal and the corresponding total cost of those paths to the goal (dual optimization) 7/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

8 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Assumes proper policies, cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : ( ) s S, V π (s) = min T(s, a, s ) γv π (s ) + c(s, a, s ) a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! the optimized policy may potentially avoid the goal... 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

9 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : s S, V π (s) = min a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! ( ) T(s, a, s ) γv π (s ) + c(s, a, s ) fsspude/isspude [Kolobov, Mausam & Weld (2012)] fsspude : goal MDPs with finite-cost unavoidable dead-ends no dual optimization of goal-probability and goal-cost isspude : goal MDPs with infinite-cost unavoidable dead-ends (required) dual optimization, limited to positive costs 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

10 S 3 P : new dual optimization criterion for goal MDPs Goal-probability function Pn G,π (s) : probability of reaching the goal in at most n steps-to-go by executing π from s Goal-cost function Cn G,π (s) : expected total cost in at most n steps-to-go by executing π from s, averaged only over paths to the goal S 3 P optimization criterion : SSP GSSP S 3 P Find an optimal (Markovian) π policy that minimizes C G,π all policies maximizing P G,π π (s) argmin C G,π (s) π: s S,π(s ) argmax π A S P G,π (s ) among 9/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

11 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

12 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

13 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

14 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

15 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 P G,π [S 3 P] 1 0 C G,π [S 3 P] 2 0 Optimal S 3 P policy Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

16 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

17 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

18 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

19 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

20 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

21 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

22 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π Optimal S 3 P policy : π 1 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

23 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

24 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

25 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

26 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 c=10 s 2 c=10 action a 3 c=100 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

27 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

28 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

29 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

30 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 S 3 P criterion : π 2 is the only optimal S 3 P policy! 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012 P G,π 1 (I) = 0.55 P G,π 2 (I) = 0.95 P G,π 3 (I) = 0.95 P G,π 4 (I) = 0 C G,π 1 (I) = C G,π 2 (I) = C G,π 3 (I) = C G,π 4 (I) = 0

31 Lessons learnt from these examples 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

32 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

33 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

34 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

35 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

36 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

37 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n normalization factor for averaging over only paths to the goal T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

38 Convergence of the S 3 P dual criterion in infinite horizon A fundamental difference with SSPs : demonstrate and use the contraction property of the transition function over the states reaching the goal with some positive probability (for any Markovian policy, in infinite horizon, and general costs) Lemma 1 (infinite horizon, general costs) Generalizes/extends Bertsekas & Tsitsiklis SSP theoretical results to S 3 Ps Let M be a general goal-oriented MDP, π a stationary policy, T π the transition matrix for policy π, and for all n N, Xn π = {s S \ G : Pn G,π (s) > 0}. Then : (i) for all s S, Pn G,π (s) converges to a finite value as n tends to + ; (ii) there exists X π S such that Xn π X π for all n N and T π is a contraction over X π. This new contraction property guarantees the wellfoundedness of the S 3 P criterion and the existence of optimal Markovian policies, without any assumption at all! 15/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

39 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

40 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). There exists a Markovian policy solution to the S 3 P problem for any goal-oriented MDP. Proposition (infinite horizon, general costs) Let M be any general goal-oriented MDP. 1 There exists an optimal stationary policy π that minimizes the infinite-horizon goal-cost function among all policies that maximize the infinite-horizon goal-probability function, ie π is S 3 P-optimal. 2 Goal-probability P G,π and goal-cost C G,π functions have finite values. 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

41 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! action a I action a 1 I s c = 2 G c = 1 action a G P n(s) = max a app(s) s S T(s, a, s )P n 1 (s ), avec : P 0(s) = 0, s S \ G; P 0(g) = 1, g G After 3 iterations : a 2 argmax a app(s) s S T(s, a, s )P 2 (s ) }{{} =1 Whereas : P G,π=(a 2,a 2,a 2, ) (s) = 0 < 1 = P (s)! 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

42 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I action a 1 action a G I s G c = 2 c = 1 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion... 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

43 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I c = 3 action a 1 action a G I s G c = 2 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion......provided all costs are positive (except from the goal) so that choosing a 2 has an infinite cost (since P G,π a 2 (s) < T(s, a 2, I )P (I }{{} )). }{{} =0 =1 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

44 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

45 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): requires only positive costs P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

46 Summarizing the new S 3 P dual optimization criterion SSP GSSP S 3 P Dual optimization : total cost minimization, averaged only over paths to the goal, among all Markovian policies maximizing the probability of reaching the goal Well-defined in finite or infinite horizon for any goal-oriented MDPs (contrary to SSPs or GSSPs) But (at the moment) : optimization equations in the form of dynamic programming only if all costs from non-goal states are positive GPCI algorithm (Goal-Probability and -Cost Iteration) isspude sub-model ( ˆ= S 3 P with positive costs) : efficient heuristic algorithms by Kolobov, Mausam & Weld (UAI 2012) 19/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

47 Experimental setup Tested problems without dead-ends (assumptions 1 and 2 of SSPs satisfied) : blocksworld, rectangle-tireworld Optimization with the standard SSP criterion, then comparison with the S 3 P dual criterion with dead-ends (assumptions 1 or 2 of SSPs unsatisfied) : exploding-blocksworld, triangle-tireworld, grid (gridworld variation on example III) Optimization with the DSSP criterion for many values of γ until maximizing (resp. minimizing) at best P G,π (resp. C G,π ) ; γ opt unknown in advance! Tested algorithms : VI, LRTDP }{{} optimal for (D)SSPs, RFF, GPCI }{{} optimal for S 3 Ps Once optimized, all policies evaluated using S 3 P criterion Systematic comparison with an optimal policy for S 3 Ps 20/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

48 Analysis of the goal-probability function P G,π Goal probability GPCI VI LRTDP RFF γ 0.99 GPCI is P G -optimal 0.2 BW RTW TTW EBW G1 G2 Domain DSSP-optimal policies (VI, LRTDP) do not maximize the probability of reaching the goal, whatever the value of γ! 21/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

49 Analysis of the goal-cost function C G,π Goal cost GPCI is C P G G, - optimal, i.e. S 3 P-optimal GPCI VI LRTDP RFF BW RTW TTW EBW G1 G2 Domain γ 0.99 DSSP-optimal policies (VI, LRTDP) do not minimize the total cost averaged only over paths to the goal, whatever γ! Smaller goal-costs for VI and LRTDP but also actually smaller goalprobabilities! γ /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

50 Comparison of computation times Computation time (in seconds) GPCI VI LRTDP RFF GPCI as efficient as VI for problems not really probabilistically interesting (P G, 1) 0.01 BW RTW TTW EBW G1 G2 Domain GPCI faster than VI, LRTDP and even RFF for problems with dead-ends and complex cost structure 23/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

51 Conclusion and perspectives An original well-founded dual criterion for goal-oriented MDPs SSP GSSP S 3 P S 3 P dual criterion : minimum goal-cost policy among the ones with maximum goal-probability S 3 P dual criterion well-defined in infinite horizon for any goal-oriented MDP (no assumptions required, contrary to SSPs or GSSPs) If costs are positive : GPCI algorithm or heuristic algorithm for the isspude sub-model [Kolobov, Mausam & Weld, 2012] Future work Uniformizing our general-cost model and the positive-cost model of Kolobov, Mausam & Weld Algorithms for solving S 3 Ps with general costs Domain-independent heuristics for estimating goal-probability and goal-cost functions 24/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

52 Thank you for your attention! If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 25/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost