Stochastic Safest and Shortest Path Problems
|
|
- Corey Jennings
- 6 years ago
- Views:
Transcription
1
2 Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012
3 Path optimization under probabilistic uncertainties Problems coming to searching for a shortest path in a probabilistic AND/OR cyclic graph OR nodes : branch choice (action) AND nodes : probabilistic outcomes of chosen branches (actions effects) Problem statement : to compute a policy to go to the goal with maximum probability or minimum expected cost-to-go Examples : Shortest path planning in probabilistic grid worlds (racetrack) Minimum number of moves of blocks to build towers with stochastic operators (exploding-blocksworld) Controller synthesis for critical systems, with maximum terminal disponibility and minimum energy consumption (embedded systems, transportation systems, servers, etc.) 3/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
4 Mathematical framework : goal-oriented MDP Goal-oriented Markov Decision Process S : finite set of states G S : finite set of goals A : finite set of actions S A S [0; 1] T : (s, a, s ) Pr(s t+1 = s : s t = s, a t = a) transition function c : S A S R : cost function associated with the transition function Absorbing goal states : g G, a A, T(g, a, g) = 1 No costs paid from goal states : g G, a A, c(g, a, g) = 0 app : S 2 A : applicable actions in a given state 4/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
5 Stochastic Shortest Path (SSP) [Bertsekas & Tsitsiklis (1996)] Optimization criterion : total (undiscounted) cost, or cost-to-go Find a Markovian policy π : S A that minimizes the expected total cost from any possible initial state [ + ] s S, π (s) = argmin V π (s) = E c t s 0 = s π A S t=0 Value of π solution of Bellman equation ( ) s S, V π (s) = min T(s, a, s ) V π (s ) + c(s, a, s ) a app(s) s S 5/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
6 SSP : required theoretical and practical assumptions Assumption 1 There exists at least one proper policy, i.e. : a policy that reaches the goal with probability 1 regardless of the initial state. Assumption 2 For every improper policy, the corresponding cost-to-go is infinite, i.e. : all cycles not leading to the goal are composed of strictly positive costs. Implications if both assumptions 1 and 2 hold There exists a policy π such that V π is finite ; An optimal Markovian (stationary) policy can be obtained using dynamic programming (Bellman equation). 6/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
7 Drawbacks of the SSP criterion SSP assumptions not easy to check in practice Deciding whether assumptions 1 and 2 hold not obvious in general Same complexity class as optimizing the cost-to-go criterion Limited practical scope Limited to optimizing policies reaching the goal with probability 1, without nonpositive-cost cycles not leading to the goal Especially annoying in presence of dead-ends or nonpositive-cost cycles In the absence of proper policies, no known method to optimize both the probability of reaching the goal and the corresponding total cost of those paths to the goal (dual optimization) 7/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
8 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Assumes proper policies, cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : ( ) s S, V π (s) = min T(s, a, s ) γv π (s ) + c(s, a, s ) a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! the optimized policy may potentially avoid the goal... 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
9 Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : s S, V π (s) = min a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! ( ) T(s, a, s ) γv π (s ) + c(s, a, s ) fsspude/isspude [Kolobov, Mausam & Weld (2012)] fsspude : goal MDPs with finite-cost unavoidable dead-ends no dual optimization of goal-probability and goal-cost isspude : goal MDPs with infinite-cost unavoidable dead-ends (required) dual optimization, limited to positive costs 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
10 S 3 P : new dual optimization criterion for goal MDPs Goal-probability function Pn G,π (s) : probability of reaching the goal in at most n steps-to-go by executing π from s Goal-cost function Cn G,π (s) : expected total cost in at most n steps-to-go by executing π from s, averaged only over paths to the goal S 3 P optimization criterion : SSP GSSP S 3 P Find an optimal (Markovian) π policy that minimizes C G,π all policies maximizing P G,π π (s) argmin C G,π (s) π: s S,π(s ) argmax π A S P G,π (s ) among 9/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
11 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
12 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
13 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
14 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
15 Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 P G,π [S 3 P] 1 0 C G,π [S 3 P] 2 0 Optimal S 3 P policy Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
16 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
17 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
18 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
19 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
20 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
21 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
22 Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π π π π Optimal S 3 P policy : π 1 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
23 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
24 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
25 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
26 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 c=10 s 2 c=10 action a 3 c=100 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
27 Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
28 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
29 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
30 Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 S 3 P criterion : π 2 is the only optimal S 3 P policy! 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012 P G,π 1 (I) = 0.55 P G,π 2 (I) = 0.95 P G,π 3 (I) = 0.95 P G,π 4 (I) = 0 C G,π 1 (I) = C G,π 2 (I) = C G,π 3 (I) = C G,π 4 (I) = 0
31 Lessons learnt from these examples 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
32 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
33 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
34 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
35 Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
36 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
37 Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n normalization factor for averaging over only paths to the goal T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
38 Convergence of the S 3 P dual criterion in infinite horizon A fundamental difference with SSPs : demonstrate and use the contraction property of the transition function over the states reaching the goal with some positive probability (for any Markovian policy, in infinite horizon, and general costs) Lemma 1 (infinite horizon, general costs) Generalizes/extends Bertsekas & Tsitsiklis SSP theoretical results to S 3 Ps Let M be a general goal-oriented MDP, π a stationary policy, T π the transition matrix for policy π, and for all n N, Xn π = {s S \ G : Pn G,π (s) > 0}. Then : (i) for all s S, Pn G,π (s) converges to a finite value as n tends to + ; (ii) there exists X π S such that Xn π X π for all n N and T π is a contraction over X π. This new contraction property guarantees the wellfoundedness of the S 3 P criterion and the existence of optimal Markovian policies, without any assumption at all! 15/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
39 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
40 Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). There exists a Markovian policy solution to the S 3 P problem for any goal-oriented MDP. Proposition (infinite horizon, general costs) Let M be any general goal-oriented MDP. 1 There exists an optimal stationary policy π that minimizes the infinite-horizon goal-cost function among all policies that maximize the infinite-horizon goal-probability function, ie π is S 3 P-optimal. 2 Goal-probability P G,π and goal-cost C G,π functions have finite values. 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
41 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! action a I action a 1 I s c = 2 G c = 1 action a G P n(s) = max a app(s) s S T(s, a, s )P n 1 (s ), avec : P 0(s) = 0, s S \ G; P 0(g) = 1, g G After 3 iterations : a 2 argmax a app(s) s S T(s, a, s )P 2 (s ) }{{} =1 Whereas : P G,π=(a 2,a 2,a 2, ) (s) = 0 < 1 = P (s)! 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
42 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I action a 1 action a G I s G c = 2 c = 1 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion... 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
43 But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I c = 3 action a 1 action a G I s G c = 2 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion......provided all costs are positive (except from the goal) so that choosing a 2 has an infinite cost (since P G,π a 2 (s) < T(s, a 2, I )P (I }{{} )). }{{} =0 =1 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
44 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
45 Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): requires only positive costs P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
46 Summarizing the new S 3 P dual optimization criterion SSP GSSP S 3 P Dual optimization : total cost minimization, averaged only over paths to the goal, among all Markovian policies maximizing the probability of reaching the goal Well-defined in finite or infinite horizon for any goal-oriented MDPs (contrary to SSPs or GSSPs) But (at the moment) : optimization equations in the form of dynamic programming only if all costs from non-goal states are positive GPCI algorithm (Goal-Probability and -Cost Iteration) isspude sub-model ( ˆ= S 3 P with positive costs) : efficient heuristic algorithms by Kolobov, Mausam & Weld (UAI 2012) 19/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
47 Experimental setup Tested problems without dead-ends (assumptions 1 and 2 of SSPs satisfied) : blocksworld, rectangle-tireworld Optimization with the standard SSP criterion, then comparison with the S 3 P dual criterion with dead-ends (assumptions 1 or 2 of SSPs unsatisfied) : exploding-blocksworld, triangle-tireworld, grid (gridworld variation on example III) Optimization with the DSSP criterion for many values of γ until maximizing (resp. minimizing) at best P G,π (resp. C G,π ) ; γ opt unknown in advance! Tested algorithms : VI, LRTDP }{{} optimal for (D)SSPs, RFF, GPCI }{{} optimal for S 3 Ps Once optimized, all policies evaluated using S 3 P criterion Systematic comparison with an optimal policy for S 3 Ps 20/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
48 Analysis of the goal-probability function P G,π Goal probability GPCI VI LRTDP RFF γ 0.99 GPCI is P G -optimal 0.2 BW RTW TTW EBW G1 G2 Domain DSSP-optimal policies (VI, LRTDP) do not maximize the probability of reaching the goal, whatever the value of γ! 21/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
49 Analysis of the goal-cost function C G,π Goal cost GPCI is C P G G, - optimal, i.e. S 3 P-optimal GPCI VI LRTDP RFF BW RTW TTW EBW G1 G2 Domain γ 0.99 DSSP-optimal policies (VI, LRTDP) do not minimize the total cost averaged only over paths to the goal, whatever γ! Smaller goal-costs for VI and LRTDP but also actually smaller goalprobabilities! γ /25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
50 Comparison of computation times Computation time (in seconds) GPCI VI LRTDP RFF GPCI as efficient as VI for problems not really probabilistically interesting (P G, 1) 0.01 BW RTW TTW EBW G1 G2 Domain GPCI faster than VI, LRTDP and even RFF for problems with dead-ends and complex cost structure 23/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
51 Conclusion and perspectives An original well-founded dual criterion for goal-oriented MDPs SSP GSSP S 3 P S 3 P dual criterion : minimum goal-cost policy among the ones with maximum goal-probability S 3 P dual criterion well-defined in infinite horizon for any goal-oriented MDP (no assumptions required, contrary to SSPs or GSSPs) If costs are positive : GPCI algorithm or heuristic algorithm for the isspude sub-model [Kolobov, Mausam & Weld, 2012] Future work Uniformizing our general-cost model and the positive-cost model of Kolobov, Mausam & Weld Algorithms for solving S 3 Ps with general costs Domain-independent heuristics for estimating goal-probability and goal-cost functions 24/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
52 Thank you for your attention! If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 25/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012
CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam
CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationStochastic Shortest Path MDPs with Dead Ends
Stochastic Shortest Path MDPs with Dead Ends Andrey Kolobov Mausam Daniel S. Weld {akolobov, mausam, weld}@cs.washington.edu Dept of Computer Science and Engineering University of Washington Seattle, USA,
More informationFast SSP Solvers Using Short-Sighted Labeling
Luis Pineda, Kyle H. Wray and Shlomo Zilberstein College of Information and Computer Sciences, University of Massachusetts, Amherst, USA July 9th Introduction Motivation SSPs are a highly-expressive model
More informationStochastic Shortest Path Problems
Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationMarkov decision processes and interval Markov chains: exploiting the connection
Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationFocused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic
Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract Real-time dynamic
More informationPROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School
PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationHeuristic Search Algorithms
CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional
More information1 Markov decision processes
2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationProbabilistic Planning. George Konidaris
Probabilistic Planning George Konidaris gdk@cs.brown.edu Fall 2017 The Planning Problem Finding a sequence of actions to achieve some goal. Plans It s great when a plan just works but the world doesn t
More informationMultiagent Value Iteration in Markov Games
Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges
More information2534 Lecture 4: Sequential Decisions and Markov Decision Processes
2534 Lecture 4: Sequential Decisions and Markov Decision Processes Briefly: preference elicitation (last week s readings) Utility Elicitation as a Classification Problem. Chajewska, U., L. Getoor, J. Norman,Y.
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationOn the Policy Iteration algorithm for PageRank Optimization
Université Catholique de Louvain École Polytechnique de Louvain Pôle d Ingénierie Mathématique (INMA) and Massachusett s Institute of Technology Laboratory for Information and Decision Systems Master s
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationDiscrete planning (an introduction)
Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135
More informationPrioritized Sweeping Converges to the Optimal Value Function
Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationWeighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications
May 2012 Report LIDS - 2884 Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications Dimitri P. Bertsekas Abstract We consider a class of generalized dynamic programming
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More information1 Stochastic Dynamic Programming
1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future
More informationLinearly-solvable Markov decision problems
Advances in Neural Information Processing Systems 2 Linearly-solvable Markov decision problems Emanuel Todorov Department of Cognitive Science University of California San Diego todorov@cogsci.ucsd.edu
More informationThe Role of Discount Factor in Risk Sensitive Markov Decision Processes
06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationThe quest for finding Hamiltonian cycles
The quest for finding Hamiltonian cycles Giang Nguyen School of Mathematical Sciences University of Adelaide Travelling Salesman Problem Given a list of cities and distances between cities, what is the
More informationSection Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018
Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationRegular Policies in Abstract Dynamic Programming
August 2016 (Revised January 2017) Report LIDS-P-3173 Regular Policies in Abstract Dynamic Programming Dimitri P. Bertsekas Abstract We consider challenging dynamic programming models where the associated
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationHomework 2: MDPs and Search
Graduate Artificial Intelligence 15-780 Homework 2: MDPs and Search Out on February 15 Due on February 29 Problem 1: MDPs [Felipe, 20pts] Figure 1: MDP for Problem 1. States are represented by circles
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More information6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE Undiscounted problems Stochastic shortest path problems (SSP) Proper and improper policies Analysis and computational methods for SSP Pathologies of
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More information1 [15 points] Search Strategies
Probabilistic Foundations of Artificial Intelligence Final Exam Date: 29 January 2013 Time limit: 120 minutes Number of pages: 12 You can use the back of the pages if you run out of space. strictly forbidden.
More informationMotivation for introducing probabilities
for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationMarkov Decision Processes (and a small amount of reinforcement learning)
Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session
More informationAbstract Dynamic Programming
Abstract Dynamic Programming Dimitri P. Bertsekas Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Overview of the Research Monograph Abstract Dynamic Programming"
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationModule 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationCS360 Homework 12 Solution
CS360 Homework 12 Solution Constraint Satisfaction 1) Consider the following constraint satisfaction problem with variables x, y and z, each with domain {1, 2, 3}, and constraints C 1 and C 2, defined
More informationRECURSION EQUATION FOR
Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u
More informationChapter 16 Planning Based on Markov Decision Processes
Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More informationThe Markov Decision Process (MDP) model
Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the
More informationLoss Bounds for Uncertain Transition Probabilities in Markov Decision Processes
Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes Andrew Mastin and Patrick Jaillet Abstract We analyze losses resulting from uncertain transition probabilities in Markov
More informationCentral-limit approach to risk-aware Markov decision processes
Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu. Inventory Management 1 1 Look at current inventory
More informationInfinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationOccupation Measure Heuristics for Probabilistic Planning
Occupation Measure Heuristics for Probabilistic Planning Felipe Trevizan, Sylvie Thiébaux and Patrik Haslum Data61, CSIRO and Research School of Computer Science, ANU Canberra, ACT, Australia first.last@anu.edu.au
More informationCS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability
CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)
More informationReading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More information6 Basic Convergence Results for RL Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 6 Basic Convergence Results for RL Algorithms We establish here some asymptotic convergence results for the basic RL algorithms, by showing
More information