CS 7180: Behavioral Modeling and Decisionmaking

Size: px

Start display at page:

Download "CS 7180: Behavioral Modeling and Decisionmaking"

Sylvia Barnett
5 years ago
Views:

1 CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012

, gripper drops its load a c b grasp(c) a b Intended outcome Exogenous events e.g., road

2 Decisions are nondeterministic In many situations, behavior and decisions may have more than one possible outcome c Action failures e.g., gripper drops its load a c b grasp(c) a b Intended outcome Exogenous events e.g., road closed a b Unintended outcome One approach: Markov Decision Processes Still need to be able to plan and make decisions in such situations

3 Decision-making under uncertainty Decisions traditionally represented in decision trees Example decision problem: Should I have my party inside or outside? Exponential in number of decisions Limitations of decision trees What if there are many stages to consider? What if there are many different outcomes? Markov decision processes (MDPs) Framework for complex multi-stage decision problems under uncertainty More compact representation of problem with efficient solutions Further specialized influence diagram No explicit representation of qualitative dependencies in out dry wet dry wet Regret Relieved Perfect! Disaster

4 Stochastic systems of actions We have already learned about stochastic systems Markov chain/markov process Stochastic system: a triple Σ = (S, A, P) S = finite set of states A = finite set of actions P a (sʹ s) = probability of going to sʹ if we execute a in s sʹ S P a (sʹ s) = 1 Several different possible action representations e.g., Bayes networks (or influence diagrams), probabilistic state-space operators, etc. Same underlying semantics Fully specified transition model explicit enumeration of each P a (sʹ s)

5 MDPs are stochastic systems with utility Model the dynamics of the environment under different actions Formally defined as a 4-tuple MDP = (S, A, T, R) S = set of states A = set of actions T = transition model T(s, a, sʹ ) = P a (sʹ s) R = reward model R(s) and C(s,a) Environmental context Possible behaviors What states can result from an action? Reward for each state Cost for each state and action Markov assumption next state depends only on current state and action Full observability cannot predict exactly which state will result from an action, but once it is realized know what it is

6 Graphical MDP representation Nodes are possible states of the world E.g., Location of robot Arcs are actions E.g., Move to new location Each arc has associated transition probability Each state has associated reward or cost

7 First-order Markov dynamics and rewards First-Order Markov dynamics (history independence) P(s t+1 a t,s t,a t-1,s t-1,...,s 0 ) = P(s t+1 a t,s t ) Next state only depends on current state and current action First-Order Markov reward process P(r t a t,s t,a t-1,s t-1,...,s 0 ) = P(r t a t,s t ) Reward only depends on current state and action Assume reward is specified by a deterministic function R(s) Stationary dynamics and reward P(s t+1 a t,s t ) = P(s k+1 a k,s k ) for all t, k The world dynamics do not change over time

8 Planning actions in MDPs Robot r1 starts at location l1 State s1 in the diagram Objective is to get r1 to location l4 State s4 in the diagram Start Goal

9 Probability of initial states For every state s, there will be a probability P(s) that the system starts in s Sometimes assume a unique state s 0 such that the system always starts in s 0 In the example, s 0 = s1 P(s1) = 1 P(s) = 0 for all s s1 Start Goal

10 Planning actions in MDPs Robot r1 starts at location l1 State s1 in the diagram Objective is to get r1 to location l4 State s4 in the diagram Start Goal No classical plan (sequence of actions) can be a solution Why not?

11 Planning actions in MDPs Robot r1 starts at location l1 State s1 in the diagram Objective is to get r1 to location l4 State s4 in the diagram Start Goal No classical plan (sequence of actions) can be a solution No guarantee we ll be in a state where the next action is applicable π = move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4)

12 Policies for choosing actions Policy is a function mapping states to actions Stationary policy π:s A π(s) is action to do at state s (regardless of time) set of state-action pairs Specifies a continuously reactive controller Nonstationary policy π:s x N A, where N is the non-negative integers π(s,t) is action to do at state s with t stages-to-go What if we want to keep acting indefinitely?

13 Example stationary policy Robot r1 starts at location l1 State s1 in the diagram Objective is to get r1 to location l4 State s4 in the diagram π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} Start Goal

14 Histories of previous states History is a sequence of system states h = s 0, s 1, s 2, s 3, s 4, Examples: h 0 = s1, s3, s1, s3, s1, h 1 = s1, s2, s3, s4, s4, h 2 = s1, s2, s5, s5, s5, h 3 = s1, s2, s5, s4, s4, h 4 = s1, s4, s4, s4, s4, h 5 = s1, s1, s4, s4, s4, h 6 = s1, s1, s1, s4, s4, h 7 = s1, s1, s1, s1, s1, Start Goal Each policy induces a probability distribution over histories If h = s 0, s 1, then P(h π) = P(s 0 ) i 0 P π(si) (s i+1 s i )

15 Example history distribution π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} Start Goal h 1 = s1, s2, s3, s4, s4, P(h 1 π 1 ) = = 0.8 h 2 = s1, s2, s5, s5 P(h 2 π 1 ) = = 0.2 P(h π 1 ) = 0 for all other h

16 Example history distribution π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} π 1 reaches the goal with probability 0.8 Start Goal h 1 = s1, s2, s3, s4, s4, P(h 1 π 1 ) = = 0.8 h 2 = s1, s2, s5, s5 P(h 2 π 1 ) = = 0.2 P(h π 1 ) = 0 for all other h

17 Example history distribution π 2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, move(r1,l5,l4))} Start Goal h 1 = s1, s2, s3, s4, s4, P(h 1 π 2 ) = = 0.8 h 3 = s1, s2, s5, s4, s4, P(h 3 π 2 ) = = 0.2 P(h π 2 ) = 0 for all other h

18 Example history distribution π 2 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, move(r1,l5,l4))} π 2 reaches the goal with probability 1.0 Start Goal h 1 = s1, s2, s3, s4, s4, P(h 1 π 2 ) = = 0.8 h 3 = s1, s2, s5, s4, s4, P(h 3 π 2 ) = = 0.2 P(h π 2 ) = 0 for all other h

19 Example history distribution π 3 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, ), (s5, move(r1,l5,l4))} Start Goal h 4 = s1, s4, s4, s4, P(h 4 π 3 ) = = 0.5 h 5 = s1, s1, s4, s4, s4, P(h 5 π 3 ) = = 0.25 h 6 = s1, s1, s1, s4, s4, P(h 6 π 3 ) = = h 7 = s1, s1, s1, s1, s1, s1, P(h 7 π 3 ) = = 0

20 Example history distribution π 3 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, ), (s5, move(r1,l5,l4))} π 3 reaches the goal with probability 1.0 Start Goal h 4 = s1, s4, s4, s4, P(h 4 π 3 ) = = 0.5 h 5 = s1, s1, s4, s4, s4, P(h 5 π 3 ) = = 0.25 h 6 = s1, s1, s1, s4, s4, P(h 6 π 3 ) = = h 7 = s1, s1, s1, s1, s1, s1, P(h 7 π 3 ) = = 0

21 Determining the value of a policy How good is a policy π? How do we measure accumulated reward? Utility function V: S R associates value with each state (or each state and time for non-stationary π) V π (s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π Optimal policy is no worse than any other policy at any state The goal of MDP reasoning is to compute an optimal policy (method depends on how we define utility)

22 Compute utility using reward function Numeric cost C(s,a) for each state s and action a Numeric reward R(s) for each state s r = 100 No explicit goals now Desirable states have high rewards r = 0 r = 0 Example: C(s,) = 0 at every state except s3 C(s,a) = 1 for each horizontal action C(s,a) = 100 for each vertical action R as shown r = 0 Start r = 100 Utility of a history/policy: If h = s 0, s 1,, then V(h π) = i 0 [R(s i ) C(s i,π(s i ))]

23 Example utility computation r = 100 π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} r = 0 h 1 = s1, s2, s3, s4, s4, P(h 1 h 2 = s1, s2, s5, s5 r = 0 Start r = 100 V(h 1 π 1 ) = [R(s1) C(s1,π 1 (s1))] + [R(s2) C(s2,π 1 (s2))] + [R(s3) C(s3,π 1 (s3))] + [R(s4) C(s4,π 1 (s4))] + [R(s4) C(s4,π 1 (s4))] + = [0 100] + [0 1] + [0 100] + [100 0] + [100 0] + = V(h 2 π 1 ) = [0 100] + [0 1] + [ 100 0] + [ 100 0] + [ 100 0] + = r = 0

24 Discounted utility In a long history of states, distant rewards/costs have less influence on the current decision Often need to use a discount factor γ, 0 γ 1 Discounted utility of a history V(h π) = i 0 γ i [R(s i ) C(s i,π(s i ))] Convergence is guaranteed if 0 γ < 1 Expected utility of a policy: E(π) = h P(h π) V(h π)

25 Discounted and expected utility r = 100 π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} r = 0 h 1 = s1, s2, s3, s4, s4, P(h 1 h 2 = s1, s2, s5, s5 r = 0 Start V(h 1 π 1 ) =.9 0 [0 100] [0 1] [0 100] [100 0] [100 0] + = V(h 2 π 1 ) =.9 0 [0 100] [0 1] [ 100 0] [ 100 0] + = E(π 1 ) = 0.8 V(h 1 π 1 ) V(h 2 π 1 ) = 0.8(547.9) + 0.2( 910.1) = r = 0 r = 100

26 Planning as optimization Consider special case with start state s 0 and all rewards are 0 Consider cost rather than utility the negative of what we had before Equations slightly simpler can generalize to the case of nonzero rewards Discounted cost of a history h and policy π: C(h π) = i 0 γ i C(s i, π(s i )) Expected cost of a policy π: E(π) = h P(h π) C(h π) A policy π is optimal if for every π', E(π) E(π') A policy π is everywhere optimal if for every s and every π', E π (s) E π' (s) where E π (s) is the expected utility if we start at s rather than s 0

27 Bellman s theorem If π is any policy, then for every s, s π(s) s 1 s 2 s n E π (s) = C(s, π(s)) + γ s S P π(s) (sʹ s) E π (sʹ ) Let Q π (s,a) be the expected cost in a state s if we start by executing the action a, and use the policy π from then onward Q π (s,a) = C(s,a) + γ sʹ S P a (sʹ s) E π (sʹ ) Bellman s theorem: Suppose π* is everywhere optimal. Then for every s, E π* (s) = min a A(s) Q π* (s,a)

28 Intuition behind Bellman s theorem Bellman s theorem: Suppose π* is everywhere optimal. Then for every s, E π* (s) = min a A(s) Q π* (s,a) If we use π* everywhere else, then the optimal actions at s is arg min a A(s) Q π* (s,a) If π* is optimal, then at each state it will pick one of those Otherwise we can construct a better policy by using an action in arg min a A(s) Q π* (s,a) instead of the action that π* uses From Bellman s theorem it follows that for all s, E π* (s) = min a A(s) {C(s,a) + γ s S P a (sʹ s) E π* (sʹ )}

29 Finite horizon decision-making First look at policy optimization over a finite horizon Assumes the agent has n time steps to live γ =1 To act optimally, should we use stationary or nonstationary policy? Put another way: If you had only one week to live would you act the same way as if you had fifty years to live?

30 Finite horizon problems Value (utility) depends on stage-to-go, hence so should policy Nonstationary π(s,k) V πk (s) is the k-stage-to-go value function for π Expected total reward/cost after executing π for k time steps V πk (s) = t = 0 to k C(s t, π(s t ))P(h t π) where C(s t, π(s t )) denotes the cost received at stage t and P(h t π) is the probability of the history seen until state t, given the policy π

31 Computing the finite-horizon value Markov property facilitates dynamic programming Use to compute V πk (s) Work backward from final stage to determine optimal policy V π0 (s) = C(s,π(s)) π(s,k) 0.7 V πk (s) = C(s,π(s)) + sʹ P π(s,k) (sʹ s) V π k-1 (sʹ ) Vk 0.3 Vk-1

optimal policy V π0 (s) = C(s,π(s)) Vk π(s,k) 0.3 0.

32 Computing the finite-horizon value Markov property facilitates dynamic programming Use to compute V πk (s) Work backward from final stage to determine optimal policy V π0 (s) = C(s,π(s)) Vk π(s,k) V πk (s) = C(s,π(s)) + sʹ P π(s,k) (sʹ s) V π k-1 (sʹ ) Vk-1 Immediate cost Expected future cost with k-1 stages to go

Bellman backup Use Bellman s theorem to incrementally compute utility value Backup the values from V t to V t+1 using DP and choosing minimum path V t s1 a 1 0.7 0.

33 Bellman backup Use Bellman s theorem to incrementally compute utility value Backup the values from V t to V t+1 using DP and choosing minimum path V t s1 a s2 V t+1 (s) s 0.4 s3 V t+1 (s) = min a A(s) {C(s,a) + 0.7V t (s1) + 0.3V t (s4), C(s,a) + 0.4V t (s2) + 0.6V t (s3) } Uses procedure called value iteration a s4

34 General case value iteration algorithm 1. Start with an arbitrary cost E 0 (s) for each s and a small ε > 0 2. For i = 1, 2, a) For every s in S and a in A, i. Q i (s,a) := C(s,a) + γ sʹ S P a (sʹ s) E i 1 (sʹ ) E i (s) = min a A(s) Q i (s,a) ii. iii. π i (s) = arg min a A(s) Q i (s,a) b) If max s S E i (s) E i 1 (s) < ε for every s then exit π i converges to π * after finitely many iterations, but how to tell it has converged? In general, E i Eπ i When π i doesn t change, E i may still change The changes in E i may make π i start changing again

35 General case value iteration algorithm 1. Start with an arbitrary cost E 0 (s) for each s and a small ε > 0 2. For i = 1, 2, a) For every s in S do i. For each a in A do Q (s,a) := C(s,a) + γ sʹ S P a (sʹ s) E i 1 (sʹ ) ii. E i (s) = min a A(s) Q(s,a) iii. π i (s) = arg min a A(s) Q(s,a) b) If max s S E i (s) E i 1 (s) < ε for every s then exit If E i changes by < ε and if ε is small enough, then π i will no longer change In this case π i has converged to π* How small is small enough?

36 Value iteration in finite-horizon case Use Markov property and Bellman backup to avoid enumerating all possibilities Value iteration Initialize to final stage cost: V 0 (s) = C(s) Compute V k (s) optimal k-stage-to-go value function V k (s) = min a A(s) {C(s,a) + s S P a (sʹ s) V k-1 (sʹ )} Derive optimal k-stage-to-go policy π*(s,k) = argmin a A(s) {C(s,a) + s S P a (sʹ s) V k-1 (sʹ )} Optimal value function is unique, but optimal policy is not Many policies can have same value

37 Finite-horizon value iteration example V 3 V 2 V 1 V 0 s1 a1 = s2 a2 = s s4 V 1 (s4) = min a A(s) {C(s4,a1) + 0.7V 0 (s1) + 0.3V 0 (s4), C(s4,a2) + 0.4V 0 (s2) + 0.6V 0 (s3) }

38 Finite-horizon value iteration example V 3 V 2 V 1 V 0 s1 a1 = s2 a2 = s s4 π*(s4,1) = argmin a1,a2 {C(s,a) + s S P a (sʹ s) V k-1 (sʹ )}

39 Complexity of finite-horizon value iteration Optimal solution to k-1 stage problem can be used without modification as part of optimal solution to k-stage problem Dynamic programming Because of finite horizon, policy nonstationary What is the computational complexity? T iterations At each iteration, each of n states computes expectation for A actions Each expectation takes O(n) time

40 Complexity of finite-horizon value iteration Optimal solution to k-1 stage problem can be used without modification as part of optimal solution to k-stage problem Dynamic programming Because of finite horizon, policy nonstationary What is the computational complexity? T iterations At each iteration, each of n states computes expectation for A actions Each expectation takes O(n) time Total time complexity: O(T A n2) Polynomial in number of states. Is this good?

41 Discounted infinite-horizon MDPs Defining value as total reward is problematic with infinite horizons Many or all policies have infinite expected reward Use our discounted utility to discount the future reward at each time step discount factor γ, 0 γ 1 Can restrict attention to stationary policies Discounted cost of a history h and policy π: C(h π) = i 0 γ i C(s i, π(s i )) Expected cost of a policy π: E(π) = h P(h π) C(h π) A policy π is optimal if for every π', E(π) E(π')

42 Value iteration in infinite horizon case Use value iteration and Bellman backup to compute optimal policies just like in finite-horizon case Now just include the discount factor in the computation Initialize to start state cost: E 0 (s) = arbitrary cost Compute V k (s) optimal value function E i (s) = min a A(s) {C(s,a) + γ s S P a (sʹ s) E i-1 (sʹ )} Will converge to the optimal value function as k gets large

43 Infinite horizon value iteration example Let a ij be the action that moves from s i to s j e.g., a 11 = and a 12 = move(r1,l1,l2) r = 0 00 Start with E 0 (s) = 0 for all s, and ε = 1 Start c = 0

44 Infinite horizon value iteration example For each s and a compute Q (s,a) := C(s,a) + γ sʹ S P a (sʹ s) E i 1 (sʹ ) Q(s1, a 11 ) = = 1 Q(s1, a 12 ) = = 100 Q(s1, a 14 ) = 1 +.9( ) = 1 Q(s2, a 21 ) = = 100 Q(s2, a 22 ) = = 1 Q(s2, a 23 ) = 1 +.9( ) = 1 Q(s3, a 32 ) = = 1 Q(s3, a 34 ) = = 100 Q(s4, a 41 ) = = 1 Q(s4, a 43 ) = = 1 Q(s4, a 44 ) = = 0 Q(s4, a 45 ) = = 100 Start 00 c = 0 00 r = 0 00 Q(s5, a 52 ) = = 1 Q(s5, a 54 ) = = 100 Q(s5, a 55 ) = = 100

45 Infinite horizon value iteration example For each s and a compute E 1 (s,a) and π 1 (s) E 1 (s1) = 1 E 1 (s2) = 1 E 1 (s3) = 1 E 1 (s4) = 0 E 1 (s5) = 1 π 1 (s1) = a 11 = π 1 (s2) = a 22 = π 1 (s3) = a 32 = move(r1,l3,l2) π 1 (s4) = a 44 = π 1 (s5) = a 52 = move(r1,l5,l2) Start 00 c = 0 00 r = 0 00

46 Infinite horizon value iteration example For each s and a compute E 1 (s,a) and π 1 (s) E 1 (s1) = 1 E 1 (s2) = 1 E 1 (s3) = 1 E 1 (s4) = 0 E 1 (s5) = 1 π 1 (s1) = a 11 = π 1 (s2) = a 22 = π 1 (s3) = a 32 = move(r1,l3,l2) π 1 (s4) = a 44 = π 1 (s5) = a 52 = move(r1,l5,l2) Start 00 c = 0 00 r = 0 What other actions could we have chosen? Is ε small enough? 00

47 Deciding how to act using MDPs Given an E i from value iteration that closely approximates E π*, what should we use as our policy? Use greedy policy π(s) = greedy[e i (s)] = argmin a A(s) Q(s,a) where Q(s,a) = C(s,a) + γ sʹ S P a (sʹ s) V(sʹ ) Note that the value of greedy policy may not be equal to E i Let E G be the value of the greedy policy? How close is E G to E π*?

48 Deciding how to act using MDPs Given an E i from value iteration that closely approximates E π*, what should we use as our policy? Use greedy policy π(s) = greedy[e i (s)] = argmin a A(s) Q(s,a) where Q(s,a) = C(s,a) + γ sʹ S P a (sʹ s) V(sʹ ) Greedy is not too far from optimal if E i close to E π* In particular, if E i within ε of E π*, then E G within 2εγ /1-γ of E π* Furthermore, there exists a finite ε s.t. greedy policy is optimal I.e., even if value estimate is off, greedy is optimal once it is close enough

49 Policy iteration for infinite-horizon planning Policy iteration is another way to find π * Suppose there are n states s 1,, s n Start with an arbitrary initial policy π 1 For i = 1, 2, Compute π i s expected costs by solving n equations with n unknowns n instances of the discounted expected cost equation For every s j, n E πi (s 1 ) = C(s,π i (s 1 ))+γ P k=1 πi (s 1 )(s k s 1 ) E πi (s k ) n E πi (s n ) = C(s,π i (s n ))+γ P πi (s n )(s k s n ) E πi (s k ) π i+1 (s j ) = argmin a A Q πi (s j, a) If π i+1 = π i then exit Converges in a finite number of iterations k=1 n = argmin a A C(s j, a)+γ P a (s k s j ) E πi (s k ) k=1

50 Policy iteration example Assume discount factor γ = 0.9 r = 0 00 s5 undesirable C(s5, ) = 100 Incentive to leave non-goal states: C(s1,) = 1 C(s2,) = 1 Start 00 c = 0 00

51 Policy iteration example Start with arbitrary policy π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} r = 0 00 Compute expected costs across all states Start c = 0

52 Policy iteration example Start with arbitrary policy π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} r = 0 00 Compute expected costs across all states Start c = 0

53 Policy iteration example Start with arbitrary policy π 1 = {(s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, ), (s5, )} r = 0 00 Compute expected costs across all states Start c = 0

54 Policy iteration example Compute expected costs across all states r = At each state s, let Start π 2 (s) = argmin a A(s) Q π (s,a) π 2 = {(s1, move(r1,l1,l4)), (s2, move(r1,l2,l1)), (s3, move(r1,l3,l4)), (s4, ), (s5, move(r1,l5,l4))} c = 0

55 Policy iteration vs. value iteration Policy iteration computes an entire policy in each iteration, and computes values based on that policy More work per iteration needs to solve a set of simultaneous equations Usually converges in a smaller number of iterations Value iteration computes new values in each iteration, and chooses a policy based on those values In general, the values are not the values that one would get from the chosen policy or any other policy Less work per iteration Usually takes more iterations to converge

56 Policy iteration vs. value iteration For both, iterations is polynomial in the number of states But the number of states is usually quite large Need to examine the entire state space in each iteration Thus, these algorithms can take huge amounts of time and space Use real-time dynamic programming and heuristics to improve efficiency

57 Real-time dynamic programming basics Explicitly specify goal states If s is a goal, then actions at s have no cost and produce no change For each state s, maintain value V(s) that gets updated as algorithm proceeds Initially V(s) = h(s), where h is a heuristic function Greedy policy π(s) = greedy[e i (s)] = argmin a A(s) Q(s,a) where Q(s,a) = C(s,a) + γ sʹ S P a (sʹ s) V(sʹ )

58 RTDP algorithm procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s'

59 RTDP algorithm procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' Forward search

60 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' Start c = 0

61 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' V = 0 Start 00 c = 0 00 Example: γ = 0.9 h(s) = 0 for all s

62 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' Example: γ = 0.9 h(s) = 0 for all s V = 0 Start Q = *0 = c = 0 00 Q = 1+.9(½*0+½*0) = 1 V = 0 Q = *0 = 100

63 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' V = 0 Start c = 0 V = 0 Example: γ = 0.9 h(s) = 0 for all s Q = 1+.9(½*0+½*0) = 1

64 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' V = 1 Start c = 0 V = 0 Example: γ = 0.9 h(s) = 0 for all s Q = 1+.9(½*0+½*0) = 1

65 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' V = 1 Start 00 c = 0 00 Example: γ = 0.9 h(s) = 0 for all s

While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s'

66 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' Example: γ = 0.9 h(s) = 0 for all s V = 1 Start Q = *0 = c = 0 00 Q = 1+.9(½*1+½*0) = 1.45 V = 0 Q = *0 = 100

67 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' V = 1 Start c = 0 V = 0 Example: γ = 0.9 h(s) = 0 for all s Q = 1+.9(½*1+½*0) = 1.45

68 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) V = 0 r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' 00 V = 1.45 Start 00 c = 0 V = 0 Example: γ = 0.9 h(s) = 0 for all s Q = 1+.9(½*1+½*0) = 1.45

69 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' 00 V = 1.45 Start c = 0 00 Example: γ = 0.9 h(s) = 0 for all s

70 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' 00 V = 1.45 Start c = 0 00 Example: γ = 0.9 h(s) = 0 for all s

71 Example of RTDP procedure RTDP(s) 1. Loop until termination condition a) RTDP-trial(s) r = 0 00 procedure RTDP-trial(s) 1. While s is not a goal state a) a := arg min a A(s) Q(s,a) b) V(s) := Q(s,a) c) Randomly pick s' with probability P a (sʹ s) d) s := s' 00 V = 1.45 Start c = 0 00 Example: γ = 0.9 h(s) = 0 for all s

72 Performance of RTDP In practice, it can solve much larger problems than policy iteration and value iteration Won t always find an optimal solution, won t always terminate If h does not overestimate, and if a goal is reachable (with positive probability) at every state then it will terminate h should be an admissible heuristic If in addition to the above, there is a positive-probability path between every pair of states Then it will find an optimal solution

Chapter 16 Planning Based on Markov Decision Processes

Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until