Stochastic Safest and Shortest Path Problems

Size: px
Start display at page:

Download "Stochastic Safest and Shortest Path Problems"

Transcription

1

2 Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012

3 Path optimization under probabilistic uncertainties Problems coming to searching for a shortest path in a probabilistic AND/OR cyclic graph OR nodes : branch choice (action) AND nodes : probabilistic outcomes of chosen branches (actions effects) Problem statement : to compute a policy to go to the goal with maximum probability or minimum expected cost-to-go Examples : Shortest path planning in probabilistic grid worlds (racetrack) Minimum number of moves of blocks to build towers with stochastic operators (exploding-blocksworld) Controller synthesis for critical systems, with maximum terminal disponibility and minimum energy consumption (embedded systems, transportation systems, servers, etc.) 3/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

4 Mathematical framework : goal-oriented MDP Goal-oriented Markov Decision Process S : finite set of states G S : finite set of goals A : finite set of actions S A S [0; 1] T : (s, a, s ) Pr(s t+1 = s : s t = s, a t = a) transition function c : S A S R : cost function associated with the transition function Absorbing goal states : g G, a A, T(g, a, g) = 1 No costs paid from goal states : g G, a A, c(g, a, g) = 0 app : S 2 A : applicable actions in a given state 4/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

5 Stochastic Shortest Path (SSP) [Bertsekas & Tsitsiklis (1996)] Optimization criterion : total (undiscounted) cost, or cost-to-go Find a Markovian policy π : S A that minimizes the expected total cost from any possible initial state [ + ] s S, π (s) = argmin V π (s) = E c t s 0 = s π A S t=0 Value of π solution of Bellman equation ( ) s S, V π (s) = min T(s, a, s ) V π (s ) + c(s, a, s ) a app(s) s S 5/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

6 SSP : required theoretical and practical assumptions Assumption 1 There exists at least one proper policy, i.e. : a policy that reaches the goal with probability 1 regardless of the initial state. Assumption 2 For every improper policy, the corresponding cost-to-go is infinite, i.e. : all cycles not leading to the goal are composed of strictly positive costs. Implications if both assumptions 1 and 2 hold There exists a policy π such that V π is finite ; An optimal Markovian (stationary) policy can be obtained using dynamic programming (Bellman equation). 6/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

7 Drawbacks of the SSP criterion SSP assumptions not easy to check in practice Deciding whether assumptions 1 and 2 hold not obvious in general Same complexity class as optimizing the cost-to-go criterion Limited practical scope Limited to optimizing policies reaching the goal with probability 1, without nonpositive-cost cycles not leading to the goal Especially annoying in presence of dead-ends or nonpositive-cost cycles In the absence of proper policies, no known method to optimize both the probability of reaching the goal and the corresponding total cost of those paths to the goal (dual optimization) 7/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

8 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Assumes proper policies, cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : ( ) s S, V π (s) = min T(s, a, s ) γv π (s ) + c(s, a, s ) a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! the optimized policy may potentially avoid the goal... 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

9 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : s S, V π (s) = min a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! ( ) T(s, a, s ) γv π (s ) + c(s, a, s ) fsspude/isspude [Kolobov, Mausam & Weld (2012)] fsspude : goal MDPs with finite-cost unavoidable dead-ends no dual optimization of goal-probability and goal-cost isspude : goal MDPs with infinite-cost unavoidable dead-ends (required) dual optimization, limited to positive costs 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

10 S 3 P : new dual optimization criterion for goal MDPs Goal-probability function Pn G,π (s) : probability of reaching the goal in at most n steps-to-go by executing π from s Goal-cost function Cn G,π (s) : expected total cost in at most n steps-to-go by executing π from s, averaged only over paths to the goal S 3 P optimization criterion : SSP GSSP S 3 P Find an optimal (Markovian) π policy that minimizes C G,π all policies maximizing P G,π π (s) argmin C G,π (s) π: s S,π(s ) argmax π A S P G,π (s ) among 9/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

11 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

12 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

13 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

14 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

15 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 P G,π [S 3 P] 1 0 C G,π [S 3 P] 2 0 Optimal S 3 P policy Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

16 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

17 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

18 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

19 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

20 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

21 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

22 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π Optimal S 3 P policy : π 1 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

23 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

24 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

25 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

26 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 c=10 s 2 c=10 action a 3 c=100 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

27 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

28 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

29 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

30 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 S 3 P criterion : π 2 is the only optimal S 3 P policy! 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012 P G,π 1 (I) = 0.55 P G,π 2 (I) = 0.95 P G,π 3 (I) = 0.95 P G,π 4 (I) = 0 C G,π 1 (I) = C G,π 2 (I) = C G,π 3 (I) = C G,π 4 (I) = 0

31 Lessons learnt from these examples 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

32 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

33 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

34 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

35 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

36 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

37 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n normalization factor for averaging over only paths to the goal T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

38 Convergence of the S 3 P dual criterion in infinite horizon A fundamental difference with SSPs : demonstrate and use the contraction property of the transition function over the states reaching the goal with some positive probability (for any Markovian policy, in infinite horizon, and general costs) Lemma 1 (infinite horizon, general costs) Generalizes/extends Bertsekas & Tsitsiklis SSP theoretical results to S 3 Ps Let M be a general goal-oriented MDP, π a stationary policy, T π the transition matrix for policy π, and for all n N, Xn π = {s S \ G : Pn G,π (s) > 0}. Then : (i) for all s S, Pn G,π (s) converges to a finite value as n tends to + ; (ii) there exists X π S such that Xn π X π for all n N and T π is a contraction over X π. This new contraction property guarantees the wellfoundedness of the S 3 P criterion and the existence of optimal Markovian policies, without any assumption at all! 15/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

39 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

40 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). There exists a Markovian policy solution to the S 3 P problem for any goal-oriented MDP. Proposition (infinite horizon, general costs) Let M be any general goal-oriented MDP. 1 There exists an optimal stationary policy π that minimizes the infinite-horizon goal-cost function among all policies that maximize the infinite-horizon goal-probability function, ie π is S 3 P-optimal. 2 Goal-probability P G,π and goal-cost C G,π functions have finite values. 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

41 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! action a I action a 1 I s c = 2 G c = 1 action a G P n(s) = max a app(s) s S T(s, a, s )P n 1 (s ), avec : P 0(s) = 0, s S \ G; P 0(g) = 1, g G After 3 iterations : a 2 argmax a app(s) s S T(s, a, s )P 2 (s ) }{{} =1 Whereas : P G,π=(a 2,a 2,a 2, ) (s) = 0 < 1 = P (s)! 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

42 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I action a 1 action a G I s G c = 2 c = 1 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion... 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

43 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I c = 3 action a 1 action a G I s G c = 2 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion......provided all costs are positive (except from the goal) so that choosing a 2 has an infinite cost (since P G,π a 2 (s) < T(s, a 2, I )P (I }{{} )). }{{} =0 =1 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

44 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

45 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): requires only positive costs P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

46 Summarizing the new S 3 P dual optimization criterion SSP GSSP S 3 P Dual optimization : total cost minimization, averaged only over paths to the goal, among all Markovian policies maximizing the probability of reaching the goal Well-defined in finite or infinite horizon for any goal-oriented MDPs (contrary to SSPs or GSSPs) But (at the moment) : optimization equations in the form of dynamic programming only if all costs from non-goal states are positive GPCI algorithm (Goal-Probability and -Cost Iteration) isspude sub-model ( ˆ= S 3 P with positive costs) : efficient heuristic algorithms by Kolobov, Mausam & Weld (UAI 2012) 19/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

47 Experimental setup Tested problems without dead-ends (assumptions 1 and 2 of SSPs satisfied) : blocksworld, rectangle-tireworld Optimization with the standard SSP criterion, then comparison with the S 3 P dual criterion with dead-ends (assumptions 1 or 2 of SSPs unsatisfied) : exploding-blocksworld, triangle-tireworld, grid (gridworld variation on example III) Optimization with the DSSP criterion for many values of γ until maximizing (resp. minimizing) at best P G,π (resp. C G,π ) ; γ opt unknown in advance! Tested algorithms : VI, LRTDP }{{} optimal for (D)SSPs, RFF, GPCI }{{} optimal for S 3 Ps Once optimized, all policies evaluated using S 3 P criterion Systematic comparison with an optimal policy for S 3 Ps 20/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

48 Analysis of the goal-probability function P G,π Goal probability GPCI VI LRTDP RFF γ 0.99 GPCI is P G -optimal 0.2 BW RTW TTW EBW G1 G2 Domain DSSP-optimal policies (VI, LRTDP) do not maximize the probability of reaching the goal, whatever the value of γ! 21/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

49 Analysis of the goal-cost function C G,π Goal cost GPCI is C P G G, - optimal, i.e. S 3 P-optimal GPCI VI LRTDP RFF BW RTW TTW EBW G1 G2 Domain γ 0.99 DSSP-optimal policies (VI, LRTDP) do not minimize the total cost averaged only over paths to the goal, whatever γ! Smaller goal-costs for VI and LRTDP but also actually smaller goalprobabilities! γ /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

50 Comparison of computation times Computation time (in seconds) GPCI VI LRTDP RFF GPCI as efficient as VI for problems not really probabilistically interesting (P G, 1) 0.01 BW RTW TTW EBW G1 G2 Domain GPCI faster than VI, LRTDP and even RFF for problems with dead-ends and complex cost structure 23/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

51 Conclusion and perspectives An original well-founded dual criterion for goal-oriented MDPs SSP GSSP S 3 P S 3 P dual criterion : minimum goal-cost policy among the ones with maximum goal-probability S 3 P dual criterion well-defined in infinite horizon for any goal-oriented MDP (no assumptions required, contrary to SSPs or GSSPs) If costs are positive : GPCI algorithm or heuristic algorithm for the isspude sub-model [Kolobov, Mausam & Weld, 2012] Future work Uniformizing our general-cost model and the positive-cost model of Kolobov, Mausam & Weld Algorithms for solving S 3 Ps with general costs Domain-independent heuristics for estimating goal-probability and goal-cost functions 24/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

52 Thank you for your attention! If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 25/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Stochastic Shortest Path MDPs with Dead Ends

Stochastic Shortest Path MDPs with Dead Ends Stochastic Shortest Path MDPs with Dead Ends Andrey Kolobov Mausam Daniel S. Weld {akolobov, mausam, weld}@cs.washington.edu Dept of Computer Science and Engineering University of Washington Seattle, USA,

More information

Fast SSP Solvers Using Short-Sighted Labeling

Fast SSP Solvers Using Short-Sighted Labeling Luis Pineda, Kyle H. Wray and Shlomo Zilberstein College of Information and Computer Sciences, University of Massachusetts, Amherst, USA July 9th Introduction Motivation SSPs are a highly-expressive model

More information

Stochastic Shortest Path Problems

Stochastic Shortest Path Problems Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract Real-time dynamic

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

Probabilistic Planning. George Konidaris

Probabilistic Planning. George Konidaris Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

2534 Lecture 4: Sequential Decisions and Markov Decision Processes 2534 Lecture 4: Sequential Decisions and Markov Decision Processes Briefly: preference elicitation (last week s readings) Utility Elicitation as a Classification Problem. Chajewska, U., L. Getoor, J. Norman,Y.

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

AM 121: Intro to Optimization Models and Methods: Fall 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018 AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted

More information

On the Policy Iteration algorithm for PageRank Optimization

On the Policy Iteration algorithm for PageRank Optimization Université Catholique de Louvain École Polytechnique de Louvain Pôle d Ingénierie Mathématique (INMA) and Massachusett s Institute of Technology Laboratory for Information and Decision Systems Master s

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

1 Stochastic Dynamic Programming

1 Stochastic Dynamic Programming 1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future

More information

Linearly-solvable Markov decision problems

Linearly-solvable Markov decision problems Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu

More information

The Role of Discount Factor in Risk Sensitive Markov Decision Processes

The Role of Discount Factor in Risk Sensitive Markov Decision Processes 06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

The quest for finding Hamiltonian cycles

The quest for finding Hamiltonian cycles The quest for finding Hamiltonian cycles Giang Nguyen School of Mathematical Sciences University of Adelaide Travelling Salesman Problem Given a list of cities and distances between cities, what is the

More information

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018 Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Regular Policies in Abstract Dynamic Programming

Regular Policies in Abstract Dynamic Programming August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Homework 2: MDPs and Search

Homework 2: MDPs and Search Graduate Artificial Intelligence 15-780 Homework 2: MDPs and Search Out on February 15 Due on February 29 Problem 1: MDPs [Felipe, 20pts] Figure 1: MDP for Problem 1. States are represented by circles

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE 6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

1 [15 points] Search Strategies

1 [15 points] Search Strategies Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Abstract Dynamic Programming

Abstract Dynamic Programming Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

CS360 Homework 12 Solution

CS360 Homework 12 Solution CS360 Homework 12 Solution Constraint Satisfaction 1) Consider the following constraint satisfaction problem with variables x, y and z, each with domain {1, 2, 3}, and constraints C 1 and C 2, defined

More information

RECURSION EQUATION FOR

RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

The Markov Decision Process (MDP) model

The Markov Decision Process (MDP) model Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the

More information

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov

More information

Central-limit approach to risk-aware Markov decision processes

Central-limit approach to risk-aware Markov decision processes Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

Occupation Measure Heuristics for Probabilistic Planning

Occupation Measure Heuristics for Probabilistic Planning Occupation Measure Heuristics for Probabilistic Planning Felipe Trevizan, Sylvie Thiébaux and Patrik Haslum Data61, CSIRO and Research School of Computer Science, ANU Canberra, ACT, Australia first.last@anu.edu.au

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

6 Basic Convergence Results for RL Algorithms

6 Basic Convergence Results for RL Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing

More information