Stochastic Safest and Shortest Path Problems

Similar documents
CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam

Stochastic Shortest Path MDPs with Dead Ends

Fast SSP Solvers Using Short-Sighted Labeling

Stochastic Shortest Path Problems

Internet Monetization

CS 7180: Behavioral Modeling and Decisionmaking

Lecture notes for Analysis of Algorithms : Markov decision processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Distributed Optimization. Song Chong EE, KAIST

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Markov Decision Processes Infinite Horizon Problems

Preference Elicitation for Sequential Decision Problems

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

Some AI Planning Problems

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Heuristic Search Algorithms

1 Markov decision processes

MDP Preliminaries. Nan Jiang. February 10, 2019

Planning in Markov Decision Processes

Probabilistic Planning. George Konidaris

Multiagent Value Iteration in Markov Games

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

AM 121: Intro to Optimization Models and Methods: Fall 2018

On the Policy Iteration algorithm for PageRank Optimization

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Discrete planning (an introduction)

Prioritized Sweeping Converges to the Optimal Value Function

Markov Decision Processes and Dynamic Programming

Reinforcement Learning

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Reinforcement Learning. Introduction

Partially Observable Markov Decision Processes (POMDPs)

Decision Theory: Q-Learning

CSE250A Fall 12: Discussion Week 9

Weighted Sup-Norm Contractions in Dynamic Programming: A Review and Some New Applications

16.4 Multiattribute Utility Functions

Elements of Reinforcement Learning

Decision Theory: Markov Decision Processes

A Gentle Introduction to Reinforcement Learning

1 Stochastic Dynamic Programming

Linearly-solvable Markov decision problems

The Role of Discount Factor in Risk Sensitive Markov Decision Processes

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Chapter 3: The Reinforcement Learning Problem

Artificial Intelligence

Reinforcement Learning. Yishay Mansour Tel-Aviv University

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The quest for finding Hamiltonian cycles

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Chapter 3: The Reinforcement Learning Problem

Final Exam December 12, 2017

Reinforcement learning

Regular Policies in Abstract Dynamic Programming

Control Theory : Course Summary

Reinforcement Learning

Q-Learning for Markov Decision Processes*

Homework 2: MDPs and Search

1 MDP Value Iteration Algorithm

6.231 DYNAMIC PROGRAMMING LECTURE 17 LECTURE OUTLINE

Lecture 3: The Reinforcement Learning Problem

1 [15 points] Search Strategies

Motivation for introducing probabilities

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 3: Markov Decision Processes

CS599 Lecture 1 Introduction To RL

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Final Exam December 12, 2017

Markov Decision Processes (and a small amount of reinforcement learning)

Abstract Dynamic Programming

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Basics of reinforcement learning

Real Time Value Iteration and the State-Action Value Function

Approximate Dynamic Programming

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

, and rewards and transition matrices as shown below:

CS360 Homework 12 Solution

RECURSION EQUATION FOR

Chapter 16 Planning Based on Markov Decision Processes

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

The Markov Decision Process (MDP) model

Loss Bounds for Uncertain Transition Probabilities in Markov Decision Processes

Central-limit approach to risk-aware Markov decision processes

Infinite-Horizon Discounted Markov Decision Processes

Occupation Measure Heuristics for Probabilistic Planning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Reinforcement learning an introduction

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

6 Basic Convergence Results for RL Algorithms

Transcription:

Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012

Path optimization under probabilistic uncertainties Problems coming to searching for a shortest path in a probabilistic AND/OR cyclic graph OR nodes : branch choice (action) AND nodes : probabilistic outcomes of chosen branches (actions effects) Problem statement : to compute a policy to go to the goal with maximum probability or minimum expected cost-to-go Examples : Shortest path planning in probabilistic grid worlds (racetrack) Minimum number of moves of blocks to build towers with stochastic operators (exploding-blocksworld) Controller synthesis for critical systems, with maximum terminal disponibility and minimum energy consumption (embedded systems, transportation systems, servers, etc.) 3/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Mathematical framework : goal-oriented MDP Goal-oriented Markov Decision Process S : finite set of states G S : finite set of goals A : finite set of actions S A S [0; 1] T : (s, a, s ) Pr(s t+1 = s : s t = s, a t = a) transition function c : S A S R : cost function associated with the transition function Absorbing goal states : g G, a A, T(g, a, g) = 1 No costs paid from goal states : g G, a A, c(g, a, g) = 0 app : S 2 A : applicable actions in a given state 4/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Stochastic Shortest Path (SSP) [Bertsekas & Tsitsiklis (1996)] Optimization criterion : total (undiscounted) cost, or cost-to-go Find a Markovian policy π : S A that minimizes the expected total cost from any possible initial state [ + ] s S, π (s) = argmin V π (s) = E c t s 0 = s π A S t=0 Value of π solution of Bellman equation ( ) s S, V π (s) = min T(s, a, s ) V π (s ) + c(s, a, s ) a app(s) s S 5/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

SSP : required theoretical and practical assumptions Assumption 1 There exists at least one proper policy, i.e. : a policy that reaches the goal with probability 1 regardless of the initial state. Assumption 2 For every improper policy, the corresponding cost-to-go is infinite, i.e. : all cycles not leading to the goal are composed of strictly positive costs. Implications if both assumptions 1 and 2 hold There exists a policy π such that V π is finite ; An optimal Markovian (stationary) policy can be obtained using dynamic programming (Bellman equation). 6/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Drawbacks of the SSP criterion SSP assumptions not easy to check in practice Deciding whether assumptions 1 and 2 hold not obvious in general Same complexity class as optimizing the cost-to-go criterion Limited practical scope Limited to optimizing policies reaching the goal with probability 1, without nonpositive-cost cycles not leading to the goal Especially annoying in presence of dead-ends or nonpositive-cost cycles In the absence of proper policies, no known method to optimize both the probability of reaching the goal and the corresponding total cost of those paths to the goal (dual optimization) 7/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Assumes proper policies, cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : ( ) s S, V π (s) = min T(s, a, s ) γv π (s ) + c(s, a, s ) a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! the optimized policy may potentially avoid the goal... 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Alternatives to the standard SSP criterion Generalized SSP (GSSP) [Kolobov, Mausam & Weld (2011)] Constraint optimization : Find the minimal-cost policy among the ones reaching the goal with probability 1 Includes many other problems, e.g. : Find the policy reaching the goal with maximum probability Drawback : Cannot find minimal-cost policies among the ones reaching the goal with maximum probability Discounted SSP (DSSP) [Teichteil, Vidal & Infantes (2011)] Relaxed heuristics for the γ-discounted criterion : s S, V π (s) = min a app(s) s S Drawback : Accumulates costs along paths not reaching the goal! ( ) T(s, a, s ) γv π (s ) + c(s, a, s ) fsspude/isspude [Kolobov, Mausam & Weld (2012)] fsspude : goal MDPs with finite-cost unavoidable dead-ends no dual optimization of goal-probability and goal-cost isspude : goal MDPs with infinite-cost unavoidable dead-ends (required) dual optimization, limited to positive costs 8/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

S 3 P : new dual optimization criterion for goal MDPs Goal-probability function Pn G,π (s) : probability of reaching the goal in at most n steps-to-go by executing π from s Goal-cost function Cn G,π (s) : expected total cost in at most n steps-to-go by executing π from s, averaged only over paths to the goal S 3 P optimization criterion : SSP GSSP S 3 P Find an optimal (Markovian) π policy that minimizes C G,π all policies maximizing P G,π π (s) argmin C G,π (s) π: s S,π(s ) argmax π A S P G,π (s ) among 9/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example I : goal-oriented MDP with proper policy action a I action a 1 action a G I s G c = 2 c = 1 2 policies : π 1 = (a I, a 1, a G, a G, a G, ), π 2 = (a I, a 2, a I, a 2, a I, a 2, ) π 1 π 2 V π (I ) [SSP] 2 P G,π [S 3 P] 1 0 C G,π [S 3 P] 2 0 Optimal S 3 P policy Assumption 2 not satisfied SSP criterion not well-defined! 10/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s action a G G action a d d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π 1 1.1 π 2 2.1 π 3 1.9 π 4 + 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example II : goal-oriented MDP without proper policy action a I I action a 1 action a 3 c = 2 c = 2 s action a s G d action a G action a d 4 Markovian policies depending on action in I a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 Optimal SSP policy : π 3 SSP well-defined but unsatisfied assumptions! From state I : V P C π 1 1.1 0.95 1.05 π 2 2.1 0.95 1.85 π 3 1.9 0.05 1 π 4 + 0 0 Optimal S 3 P policy : π 1 11/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 c=10 s 2 c=10 action a 3 c=100 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 action a G G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a action a 1 1 c=1 s 1 c=1 d 1 a 1 action a I I 000 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 2 d 3 a 2 0 a 3 00 G action a G 4 Markovian policies according to action in I : a 1 π 1 a 2 π 2 a 3 π 3 a I π 4 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Example III : goal-oriented MDP without proper policy action a I I 000 action a action a 1 1 c=1 s 1 c=1 action a 3 c=100 c=10 s 2 c=10 s 3 action a 3 c=100 d 1 d 2 d 3 a 1 a 2 0 a 3 00 G action a G Assumptions 1 and 2 both unsatisfied : i {1, 2, 3, 4}, V π i (I ) = + γ-discounted criterion : 0 < γ < 1, V π 4 γ (I ) > V π 3 γ (I ) > V π 2 γ (I ) > V π 1 γ (I ) π 1 is the only optimal DSSP policy for all 0 < γ < 1 S 3 P criterion : π 2 is the only optimal S 3 P policy! 12/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012 P G,π 1 (I) = 0.55 P G,π 2 (I) = 0.95 P G,π 3 (I) = 0.95 P G,π 4 (I) = 0 C G,π 1 (I) = 1.818 C G,π 2 (I) = 10.53 C G,π 3 (I) = 147.4 C G,π 4 (I) = 0

Lessons learnt from these examples 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Lessons learnt from these examples If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 13/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Policy evaluation for the S 3 P dual criterion Theorem 1 (finite horizon H N, general costs) For all steps-to-go 1 n < H, all history-dependent policies π = (π 0,, π H 1 ) and all states s S : Pn G,π (s) = T(s, π H n (s), s )P G,π n 1 (s ), with : s S P G,π 0 (s) = 0, s S \ G, and P G,π 0 (g) = 1, g G (1) If Pn G,π (s) > 0, Cn G,π (s) is well-defined, and satisfies : C G,π n (s) = 1 P G,π n normalization factor for averaging over only paths to the goal T(s, π H n (s), s )P G,π n 1 (s ) (s) s S [ ] c(s, π H n (s), s ) + C G,π n 1 (s ), with : C G,π o (s) = 0, s S (2) 14/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Convergence of the S 3 P dual criterion in infinite horizon A fundamental difference with SSPs : demonstrate and use the contraction property of the transition function over the states reaching the goal with some positive probability (for any Markovian policy, in infinite horizon, and general costs) Lemma 1 (infinite horizon, general costs) Generalizes/extends Bertsekas & Tsitsiklis SSP theoretical results to S 3 Ps Let M be a general goal-oriented MDP, π a stationary policy, T π the transition matrix for policy π, and for all n N, Xn π = {s S \ G : Pn G,π (s) > 0}. Then : (i) for all s S, Pn G,π (s) converges to a finite value as n tends to + ; (ii) there exists X π S such that Xn π X π for all n N and T π is a contraction over X π. This new contraction property guarantees the wellfoundedness of the S 3 P criterion and the existence of optimal Markovian policies, without any assumption at all! 15/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Evaluating and optimizing S 3 P policies in infinite horizon The S 3 P dual criterion is well-defined for any goal-oriented MDP and any Markovian policy. Theorem 2 (infinite horizon, general costs) Let M be any general goal-oriented MDP, and π any stationary policy for M. Evaluation equations of Theorem 1 converge to finite values P G,π (s) and C G,π (s) for any s S (by convention, Cn G,π (s) = 0 if Pn G,π (s) = 0, n N). There exists a Markovian policy solution to the S 3 P problem for any goal-oriented MDP. Proposition (infinite horizon, general costs) Let M be any general goal-oriented MDP. 1 There exists an optimal stationary policy π that minimizes the infinite-horizon goal-cost function among all policies that maximize the infinite-horizon goal-probability function, ie π is S 3 P-optimal. 2 Goal-probability P G,π and goal-cost C G,π functions have finite values. 16/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! action a I action a 1 I s c = 2 G c = 1 action a G P n(s) = max a app(s) s S T(s, a, s )P n 1 (s ), avec : P 0(s) = 0, s S \ G; P 0(g) = 1, g G After 3 iterations : a 2 argmax a app(s) s S T(s, a, s )P 2 (s ) }{{} =1 Whereas : P G,π=(a 2,a 2,a 2, ) (s) = 0 < 1 = P (s)! 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I action a 1 action a G I s G c = 2 c = 1 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion... 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

But wait... Can we easily optimize S 3 Ps? Greedy Markovian policies w.r.t. goal-probability or goal-cost metrics do not need to be optimal! Solution action a I c = 3 action a 1 action a G I s G c = 2 Implicitly eliminate non-optimal greedy policies w.r.t. goal-probability criterion, by searching for greedy policies w.r.t. goal-cost criterion......provided all costs are positive (except from the goal) so that choosing a 2 has an infinite cost (since P G,π a 2 (s) < T(s, a 2, I )P (I }{{} )). }{{} =0 =1 17/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Optimisation of S 3 P Markovian policies in infinite horizon Under positive costs : greedy policies w.r.t. goal-probability and goal-cost functions are S 3 P optimal Theorem 3 (infinite horizon, positive costs only) Let M be a goal-oriented MDP such that all transitions from non-goal states have strictly positive costs. P n(s) = max a app(s) s S T(s, a, s )P n 1(s ), with : Functions P n converge to a finite-values function P. C n (s) = 0 if P (s) = 0, otherwise if P (s) > 0 : C n (s) = a app(s): requires only positive costs P 0 (s) = 0, s S \ G; P 0 (g) = 1, g G (3) 1 min s S T(s,a,s )P (s )=P (s) P (s) T(s, a, s S s )P (s ) [c(s, a, s ) + Cn 1(s )], with : C o (s) = 0, s S (4) Functions Cn converge to a finite-values function C and any Markovian policy π obtained from the previous equations at convergence, is S 3 P optimal. 18/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Summarizing the new S 3 P dual optimization criterion SSP GSSP S 3 P Dual optimization : total cost minimization, averaged only over paths to the goal, among all Markovian policies maximizing the probability of reaching the goal Well-defined in finite or infinite horizon for any goal-oriented MDPs (contrary to SSPs or GSSPs) But (at the moment) : optimization equations in the form of dynamic programming only if all costs from non-goal states are positive GPCI algorithm (Goal-Probability and -Cost Iteration) isspude sub-model ( ˆ= S 3 P with positive costs) : efficient heuristic algorithms by Kolobov, Mausam & Weld (UAI 2012) 19/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Experimental setup Tested problems without dead-ends (assumptions 1 and 2 of SSPs satisfied) : blocksworld, rectangle-tireworld Optimization with the standard SSP criterion, then comparison with the S 3 P dual criterion with dead-ends (assumptions 1 or 2 of SSPs unsatisfied) : exploding-blocksworld, triangle-tireworld, grid (gridworld variation on example III) Optimization with the DSSP criterion for many values of γ until maximizing (resp. minimizing) at best P G,π (resp. C G,π ) ; γ opt unknown in advance! Tested algorithms : VI, LRTDP }{{} optimal for (D)SSPs, RFF, GPCI }{{} optimal for S 3 Ps Once optimized, all policies evaluated using S 3 P criterion Systematic comparison with an optimal policy for S 3 Ps 20/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Analysis of the goal-probability function P G,π Goal probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 GPCI VI LRTDP RFF γ 0.99 GPCI is P G -optimal 0.2 BW RTW TTW EBW G1 G2 Domain DSSP-optimal policies (VI, LRTDP) do not maximize the probability of reaching the goal, whatever the value of γ! 21/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Analysis of the goal-cost function C G,π Goal cost GPCI is C P G G, - optimal, i.e. S 3 P-optimal 35 30 25 20 15 10 5 GPCI VI LRTDP RFF BW RTW TTW EBW G1 G2 Domain γ 0.99 DSSP-optimal policies (VI, LRTDP) do not minimize the total cost averaged only over paths to the goal, whatever γ! Smaller goal-costs for VI and LRTDP but also actually smaller goalprobabilities! γ 0.99 22/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Comparison of computation times Computation time (in seconds) 100 10 1 0.1 GPCI VI LRTDP RFF GPCI as efficient as VI for problems not really probabilistically interesting (P G, 1) 0.01 BW RTW TTW EBW G1 G2 Domain GPCI faster than VI, LRTDP and even RFF for problems with dead-ends and complex cost structure 23/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Conclusion and perspectives An original well-founded dual criterion for goal-oriented MDPs SSP GSSP S 3 P S 3 P dual criterion : minimum goal-cost policy among the ones with maximum goal-probability S 3 P dual criterion well-defined in infinite horizon for any goal-oriented MDP (no assumptions required, contrary to SSPs or GSSPs) If costs are positive : GPCI algorithm or heuristic algorithm for the isspude sub-model [Kolobov, Mausam & Weld, 2012] Future work Uniformizing our general-cost model and the positive-cost model of Kolobov, Mausam & Weld Algorithms for solving S 3 Ps with general costs Domain-independent heuristics for estimating goal-probability and goal-cost functions 24/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012

Thank you for your attention! If you care about : First reach the goal, then minimize costs along paths to the goal like in deterministic/classical planning! Then be aware that : Previous existing criteria do not guarantee to produce such policies Without mentioning that : Standard criteria do need to be well-defined for many interesting goal-oriented MDPs contrary to S 3 Ps So if the notion of goal and well-foundedness guarantees matter for you, you may be interesting in S 3 Ps. 25/25 Stochastic Safest and Shortest Path Problems F. Teichteil-K. AAAI-12 July 24-26, 2012