Markov Decision Processes Chapter 17. Mausam

Similar documents
Markov Decision Processes Chapter 17. Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CS 7180: Behavioral Modeling and Decisionmaking

Planning in Markov Decision Processes

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Heuristic Search Algorithms

Reinforcement Learning and Control

Distributed Optimization. Song Chong EE, KAIST

Internet Monetization

Introduction to Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Chapter 16 Planning Based on Markov Decision Processes

Real Time Value Iteration and the State-Action Value Function

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS 4100 // artificial intelligence. Recap/midterm review!

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning. Introduction

Some AI Planning Problems

Final Exam December 12, 2017

Chapter 3: The Reinforcement Learning Problem

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement learning an introduction

Decision Theory: Markov Decision Processes

Lecture 3: The Reinforcement Learning Problem

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Stochastic Safest and Shortest Path Problems

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Prioritized Sweeping Converges to the Optimal Value Function

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Final Exam December 12, 2017

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning. Machine Learning, Fall 2010

CS599 Lecture 1 Introduction To RL

Probabilistic Planning. George Konidaris

Decision Theory: Q-Learning

Notes on Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning. George Konidaris

Markov Decision Processes Infinite Horizon Problems

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Elements of Reinforcement Learning

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Markov decision processes

Reinforcement Learning II

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement learning

CSE250A Fall 12: Discussion Week 9

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Machine Learning I Reinforcement Learning

Fast SSP Solvers Using Short-Sighted Labeling

Lecture 1: March 7, 2018

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

MDP Preliminaries. Nan Jiang. February 10, 2019

1 Markov decision processes

RL 14: POMDPs continued

CS230: Lecture 9 Deep Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

The Reinforcement Learning Problem

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Reinforcement Learning

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

, and rewards and transition matrices as shown below:

Reinforcement Learning

Q-learning. Tambet Matiisen

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

An Adaptive Clustering Method for Model-free Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

6 Reinforcement Learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning: An Introduction

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Basics of reinforcement learning

Grundlagen der Künstlichen Intelligenz

Decision making, Markov decision processes

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning

An Algorithm better than AO*?

Lecture 3: Markov Decision Processes

Artificial Intelligence

Introduction to Reinforcement Learning

Chapter 4: Dynamic Programming

Transcription:

Markov Decision Processes Chapter 17 Mausam

Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs. Durative Percepts Actions 2

Classical Planning Static Environment Fully Observable Perfect What action next? Deterministic Instantaneous Percepts Actions 3

Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions 4

MDP vs. Decision Theory Decision theory episodic MDP -- sequential 5

Markov Decision Process (MDP) S: A set of states A: A set of actions T(s,a,s ): transition model C(s,a,s ): cost model G: set of goals : start state : discount factor R(s,a,s ): reward model factored Factored MDP absorbing/ non-absorbing 6

Objective of an MDP Find a policy : S A which optimizes minimizes maximizes maximizes discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability 7

Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r 2 + 2 r 3 + Total cost: c 1 + c 2 + 2 c 3 + 8

Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimization MDP <S, A, T, C, G, > Most often studied in planning, graph theory communities Infinite Horizon, Discounted Reward Maximization MDP <S, A, T, R, > Most often studied in machine learning, economics, operations research communities most popular Oversubscription Planning: Non absorbing goals, Reward Max. MDP <S, A, T, G, R, > Relatively recent model 9

Acyclic vs. Cyclic MDPs a P b 0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5 a P b Q R S T R S T c c c c c c c G C(a) = 5, C(b) = 10, C(c) =1 Expectimin works V(Q/R/S/T) = 1 V(P) = 6 action a G Expectimin doesn t work infinite loop V(R/S/T) = 1 Q(P,b) = 11 Q(P,a) =???? suppose I decide to take a in P Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a) 10 = 13.5

Policy Evaluation Given a policy ¼: compute V ¼ V ¼ : cost of reaching goal while following ¼ 12

Deterministic MDPs Policy Graph for ¼ ¼( ) = a 0 ; ¼(s 1 ) = a 1 C=5 C=1 s 1 s a g 0 a 1 V ¼ (s 1 ) = 1 V ¼ ( ) = 6 add costs on path to goal 13

Acyclic MDPs Policy Graph for ¼ V ¼ (s 1 ) = 1 V ¼ (s 2 ) = 4 Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 C=1 C=4 V ¼ ( ) = 0.6(5+1) + 0.4(2+4) = 6 a 2 s g backward pass in reverse topological order 14

General MDPs can be cyclic! Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 a 2 C=1 Pr=0.7 C=4 Pr=0.3 C=3 s g cannot do a simple single pass V ¼ (s 1 ) = 1 V ¼ (s 2 ) =?? (depends on V ¼ ( )) V ¼ ( ) =?? (depends on V ¼ (s 2 )) 15

General SSPs can be cyclic! Pr=0.6 C=5 s 1 a 0 a 1 Pr=0.4 C=2 s 2 V ¼ (g) = 0 V ¼ (s 1 ) = 1+V ¼ (s g ) = 1 V ¼ (s 2 ) = 0.7(4+V ¼ (s g )) + 0.3(3+V ¼ ( )) V ¼ ( ) = 0.6(5+V ¼ (s 1 )) + 0.4(2+V ¼ (s 2 )) a 2 C=1 Pr=0.7 C=4 Pr=0.3 C=3 s g a simple system of linear equations 16

Policy Evaluation (Approach 1) Solving the System of Linear Equations V ¼ (s) = 0 if s 2 G = X 2S T (s; ¼(s); ) [C(s; ¼(s); ) + V ¼ ( )] S variables. O( S 3 ) running time 17

Iterative Policy Evaluation 1 4.4+0.4V ¼ (s 2 ) 0 5.88 6.5856 6.670272 6.68043.. Pr=0.6 C=5 Pr=0.4 C=2 s 1 s 2 a 2 C=1 a 0 a 1 0 3.7+0.3V ¼ ( ) 3.7 5.464 5.67568 5.7010816 5.704129 Pr=0.7 C=4 Pr=0.3 C=3 s g 18

Policy Evaluation (Approach 2) V ¼ (s) = X 2S T (s; ¼(s); ) [C(s; ¼(s); ) + V ¼ ( )] iterative refinement V ¼ n (s) Ã X 2S T (s; ¼(s); ) C(s; ¼(s); ) + V ¼ n 1( ) 19

Iterative Policy Evaluation iteration n ²-consistency termination condition 20

Policy Evaluation Value Iteration (Bellman Equations for MDP 1 ) <S, A, T, C,G, > Define V*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. V* should satisfy the following equation: V (s) = 0 if s 2 G X = min T (s; a; ) [C(s; a; ) + V ( )] a2a 2S Q*(s,a) V*(s) = min a Q*(s,a) 22

Bellman Equations for MDP 2 <S, A, T, R,, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation: 23

Fixed Point Computation in VI V (s) = min a2a X 2S T (s; a; ) [C(s; a; ) + V ( )] iterative refinement V n (s) Ã min a2a X 2S T (s; a; ) [C(s; a; ) + V n 1 ( )] non-linear 24

Example a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g 25

Bellman Backup s 4 s 3 a 41 a 3 C=2 a 40 C=5 Pr=0.6 Pr=0.4 s g min Q 1 (s 4,a 40 ) = 5 + 0 Q 1 (s 4,a 41 ) = 2+ 0.6 0 + 0.4 2 = 2.8 V 1 = 2.8 s 4 a greedy = a 41 C=5 C=2 a 40 a 41 s g V 0 = 0 s 3 V 0 = 2

Value Iteration [Bellman 57] No restriction on initial value function iteration n ²-consistency termination condition 27

Example (all actions cost 1 unless otherwise stated) a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g n V n ( ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 28

Comments Decision-theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilistic version of Bellman-Ford Algorithm for shortest path computation MDP 1 : Stochastic Shortest Path Problem Time Complexity one iteration: O( S 2 A ) number of iterations: poly( S, A, 1/(1- )) Space Complexity: O( S ) 31

Monotonicity For all n>k V k p V * V n p V* (V n monotonic from below) V k p V * V n p V* (V n monotonic from above) 32

Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration Search in policy space Compute the resulting value 40

Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat costly: O(n 3 ) Policy Evaluation: compute V n+1 : the evaluation of n Policy Improvement: for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow! Modified Policy Iteration approximate by value iteration using fixed policy 41

Modified Policy iteration assign an arbitrary assignment of 0 to each state. repeat Policy Evaluation: compute V n+1 the approx. evaluation of n Policy Improvement: for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage probably the most competitive synchronous dynamic programming algorithm. 42

Applications Stochastic Games Robotics: navigation, helicopter manuevers Finance: options, investments Communication Networks Medicine: Radiation planning for cancer Controlling workflows Optimize bidding decisions in auctions Traffic flow optimization Aircraft queueing for landing; airline meal provisioning Optimizing software on mobiles Forest firefighting 43

VI Asynchronous VI Is backing up all states in an iteration essential? No! States may be backed up as many times in any order If no state gets starved convergence properties still hold!! 44

Residual wrt Value Function V (Res V ) Residual at s with respect to V magnitude( V(s)) after one Bellman backup at s Res V (s) = V (s) min a2a X 2S T (s; a; )[C(s; a; ) + V ( )] Residual wrt respect to V max residual Res V = max s (Res V (s)) Res V <² (²-consistency) 45

(General) Asynchronous VI 46

Prioritization of Bellman Backups Are all backups equally important? Can we avoid some backups? Can we schedule the backups more appropriately? 47

Useless Backups? a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g n V n ( ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 48

Useless Backups? a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g n V n ( ) V n (s 1 ) V n (s 2 ) V n (s 3 ) V n (s 4 ) 0 3 3 2 2 1 1 3 3 2 2 2.8 2 3 3 3.8 3.8 2.8 3 4 4.8 3.8 3.8 3.52 4 4.8 4.8 4.52 4.52 3.52 5 5.52 5.52 4.52 4.52 3.808 20 5.99921 5.99921 4.99969 4.99969 3.99969 49

Asynch VI Prioritized VI 50

Which state to prioritize? s' V=0 s' V=0 s' V=0............ s 1 s'.. V=0.. s 2 0.8 0.1 s'.. V=2.. s 3 s'.. V=5.. s' V=0 s' V=0 s' V=0 s 1 is zero priority s 2 is higher priority s 3 is low priority 51

Prioritized Sweeping priority P S (s) = max ½ ¾ priority P S (s); max ft (s; a; a2a s0 )Res V ( )g Convergence [Li&Littman 08] Prioritized Sweeping converges to optimal in the limit, if all initial priorities are non-zero. (does not need synchronous VI iterations) 52

Prioritized Sweeping a 20 a a 00 s 2 s 40 4 a 41 a 21 a 1 a C=2 3 a 01 s 1 s 3 C=5 Pr=0.6 Pr=0.4 s g V( ) V(s 1 ) V(s 2 ) V(s 3 ) V(s 4 ) Initial V 3 3 2 2 1 3 3 2 2 2.8 Priority 0 0 1.8 1.8 0 Update 3 3 3.8 3.8 2.8 Priority 2 2 0 0 1.2 Update 3 4.8 3.8 3.8 2.8 53

Limitations of VI/Extensions Scalability Memory linear in size of state space Time at least polynomial or more Polynomial is good, no? state spaces are usually huge. if n state vars then 2 n states! Curse of Dimensionality! 54

Heuristic Search Insight 1 knowledge of a start state to save on computation ~ (all sources shortest path single source shortest path) Insight 2 additional knowledge in the form of heuristic function ~ (dfs/bfs A*) 55

Model MDP with an additional start state denoted by MDP s0 What is the solution to an MDP s0 Policy (S!A)? are states that are not reachable from relevant? states that are never visited (even though reachable)? 56

Partial Policy Define Partial policy ¼: S! A, where S µ S Define Partial policy closed w.r.t. a state s. is a partial policy ¼ s defined for all states s reachable by ¼ s starting from s 57

Partial policy closed wrt s 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g 58

Partial policy closed wrt s 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g Is this policy closed wrt? ¼ s0 ( )= a 1 ¼ s0 (s 1 )= a 2 ¼ s0 (s 2 )= a 1 59

Partial policy closed wrt s 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g Is this policy closed wrt? ¼ s0 ( )= a 1 ¼ s0 (s 1 )= a 2 ¼ s0 (s 2 )= a 1 60

Partial policy closed wrt s 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g Is this policy closed wrt? ¼ s0 ( )= a 1 ¼ s0 (s 1 )= a 2 ¼ s0 (s 2 )= a 1 ¼ s0 (s 6 )= a 1 61

Policy Graph of ¼ s0 s 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g ¼ s0 ( )= a 1 ¼ s0 (s 1 )= a 2 ¼ s0 (s 2 )= a 1 ¼ s0 (s 6 )= a 1 62

Greedy Policy Graph Define greedy policy: ¼ V = argmin a Q V (s,a) Define greedy partial policy rooted at Partial policy rooted at Greedy policy denoted by ¼ V s0 Define greedy policy graph Policy graph of ¼ V s0 : denoted by G V s0 63

Heuristic Function h(s): S!R estimates V*(s) gives an indication about goodness of a state usually used in initialization V 0 (s) = h(s) helps us avoid seemingly bad states Define admissible heuristic optimistic h(s) V*(s) 64

A General Scheme for Heuristic Search in MDPs Two (over)simplified intuitions Focus on states in greedy policy wrt V rooted at Focus on states with residual > ² Find & Revise: repeat find a state that satisfies the two properties above revise: perform a Bellman backup until no such state remains 65

A* LAO* regular graph soln:(shortest) path A* acyclic AND/OR graph soln:(expected shortest) acyclic graph AO* [Nilsson 71] cyclic AND/OR graph soln:(expected shortest) cyclic graph LAO* [Hansen&Zil. 98] All algorithms able to make effective use of reachability information!

LAO* Family add to the fringe and to greedy policy graph repeat FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value choose a subset of affected states REVISE: perform some Bellman backups on this subset recompute the greedy graph until greedy graph has no fringe & residuals in greedy graph small output the greedy graph as the final policy 68

LAO* add to the fringe and to greedy policy graph repeat FIND: expand best state s on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s REVISE: perform VI on this subset recompute the greedy graph until greedy graph has no fringe & residuals in greedy graph small output the greedy graph as the final policy 69

LAO* V( ) = h( ) s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g add in the fringe and in greedy graph 70

LAO* V( ) = h( ) s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) 71

LAO* V( ) s 1 s 2 s 3 s 4 h s 1 s 2 h h s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset 72

LAO* V( ) s 1 s 2 s 3 s 4 h s 1 s 2 h h s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 73

LAO* V( ) s 1 s 2 s 3 s 4 h s 1 s 2 h h s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph s h 6 s 7 h 74

LAO* V( ) s 1 s 2 s 3 s 4 h s 1 s 2 h h s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph s h 6 s 7 h 75

LAO* V s 1 s 2 s 3 s 4 h s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph s h 6 s 7 h 76

LAO* V s 1 s 2 s 3 s 4 h s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph s h 6 s 7 h 77

LAO* V s 1 s 2 s 3 s 4 h s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 78

LAO* V s 1 s 2 s 3 s 4 h s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 79

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 80

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 h V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 81

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 V V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 82

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 V V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s h 6 s 7 S g h FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 83

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 V V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s V 6 s 7 S g h output the greedy graph as the final policy 84

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 V V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s V 6 s 7 S g h output the greedy graph as the final policy 85

LAO* V s 1 s 2 s 3 s 4 V s 1 s 2 V V s 3 s 4 h s 5 s 6 s 7 s 8 S g h s 5 0 s V 6 s h 7 s 8 S g s 4 was never expanded s 8 was never touched 86

Extensions Heuristic Search + Dynamic Programming AO*, LAO*, RTDP, Factored MDPs add planning graph style heuristics use goal regression to generalize better Hierarchical MDPs hierarchy of sub-tasks, actions to scale better Reinforcement Learning learning the probability and rewards acting while learning connections to psychology Partially Observable Markov Decision Processes noisy sensors; partially observable environment popular in robotics 91