Task Planning AUTONOMOUS SYSTEMS. Pedro U. Lima M. Isabel Ribeiro. Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal

Size: px

Start display at page:

Download "Task Planning AUTONOMOUS SYSTEMS. Pedro U. Lima M. Isabel Ribeiro. Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal"

Dale Watts
5 years ago
Views:

1 AUTONOMOUS SYSTEMS Task Planning Pedro U. Lima M. Isabel Ribeiro Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal March 2007

2 Outline 1. Planning Problem 2. Logic 3. Logic-Based Planning: Situation Calculus, STRIPS 4. Plan Representation and Modeling: Petri Net Task Models 5. Plan Analysis 6. Planning Under Uncertainty 7. Markov Decision Processes (MDP) 8. Dynamic Programming Solution of MDPs 9. Reinforcement Learning Solution of MDPs

3 Planning Planning consists of determining the action sequence that enables reaching the goal(s) of an agent. Robot Task Planning consists of determining the appropriate set of actions to move a robot from the current world state to a world state that satisfies its preferences.

4 Logic Logic can be seen as a language to represent the knowledge about the world and a particular problem to be solved. Syntactic System Set of accepted symbols Set of rules establishing how symbols can be aggregated so as to build formulas/ sentences Alphabet Formation rules LANGUAGE Set of rules that establish how to derive formulas from other formulas INFERENCE RULES

5 Logic Semantic System Assigns a meaning to the language formulas World (semantics) Facts Language (syntax) Formulas

6 Logic Syntactic System vs Semantic System Language rules g + r + e + e + n green Associates a color to the word green Arithmetic rules x, y expressions representing numbers x > y is a formula over numbers Fact is true when the number represented by x is greater than the number represented by y

7 Logic Typically, one deals only with the world issues relevant for the problem, through a conceptualization of the reality Objects and their relations are defined Functions given a set of objects, a function establishes which object is related to the object(s) in the set and how, e.g., left_room(kitchen) Relations given a set of objects, establishes if that set is related in a certain way e.g., on(laptop, table)

8 Logic The concept of interpretation establishes the link between the language elements and the conceptualization of the reality elements (objects, functions and relations). Given a formula written in the defined language, its interpretation is designated as proposition A proposition is true iff it correctly describes the world, based on the adopted conceptualization of the reality A formula is satisfied iff thereis an interpretation that associates it to a true proposition

9 Logic A fact is a true proposition for a given (conceptualized) world state The initial known facts compose the initial knowledge base Inference is the process of obtaining new propositions (conclusions) from the knowledge base To ensure that a given reached conclusion is satisfied by the adopted interpretation, only a conclusion satisfied for all the interpretations that satisfy the starting propositions (premises) is accepted. This way, we guarantee that, should the premises be satisfied, so is the conclusion, irrespectively of the interpretation. Example: Premises Conclusion IF on(a,b) THEN above(a,b) above(a,b) on(a,b)

10 Logic Entailment (semantic perspective of inference) means that the truth of a given fact is entailed from the knowledge base or one of its subsets KB α or Γ α KB or Γ are the premises and α is the conclusion Derivation (syntactic perspective of inference) is the process of proving new formulas from a set of existing formulas F f ou Γ α α (or f) denotes the formula proved from Γ (or F)

11 Logic An inference mechanism is sound iff any formula proven/derived from a set of formulas, using that mechanism, is entailed by the set IF Γ α syntactic perspective THEN Γ α semantic perspective An inference mechanism is complete iff, for any proposition entailed by a premise set, the formula denoting that proposition is provable/derivable from the premise set, using that mechanism IF Γ α THEN Γ α

12 Logic Propositional Logic Facts objects, functions and relations Predicate Logic variables quantifiers

13 Situation Calculus logic handles propositions truth, not action execution logic can not tell which action should be executed at most it can suggest the possible actions time and changes are not adequately handled by basic logic (propositional, predicate) Idea: the world state is represented by a proposition set the set is changed according to received perceptions and executed actions the world evolution is described by diachronic rules, which express how the world changes representation of change Situation Calculus attempts to solve the problems associated to representation and reasoning under changes. It is based on predicate logic and describes the world as a sequence of situations, each of which represents a world state

14 Situation Calculus one situation is generated from other situation by executing an action an argument is added to each property (represented by a predicate) that may change denoting the situtation where the property is satisfied Ex: localization( agent, (1,1), S 0 ) localization( agent, (1,2), S 1 ) to represent passing from one situation to another, the following function is used: Result( action, situation) : Α Σ Σ Ex: Result( go_ahead, S 0 ) = S 1

15 Situation Calculus Effect Axioms pre-conditions predicate (to execute the action) (whose logical value changes after the action is executed) State action effects to describe the change(s) due to the action effect(s) e.g.,: x s Present(x, s) Portable(x) Hold(x, Result(pickup, s)) x s Hold(x, Result(release, s))

16 Situation Calculus Frame Axioms predicate conditions predicate (logical value in current situation) (for no change) (in the situation following the action) One needs to explain what does not change due to the action execution e.g.,: a x s Hold(x, s) (a releases) Hold(x, Result(a, s)) a x s Hold(x, s) (a pickup (Present(x, s) Portable(x)) Hold(x, Result(a, s))

17 Situation Calculus Successor State Axioms merge effect and frame axioms Predicate true in the next situation [ one action makes it true It was true in the previous situation no action made it false] e.g., a x s Hold(x, Result(a, s)) [ (a = pickup Present(x, s) Portable(x)) (Hold(x, s) a releases) ] a x s Hold(x, Result(a, s)) [ (a = releases) ( Hold(x, s) (a pickup (Present(x, s) Portable(x)))) ]

18 Situation Calculus Example (Blocks World) a b c Initial Situation Action Sequence? Final Situation c b a Predicates: On(x, y, s) ClearTop(x,s) Block(x) Objects: A B C M (blocks and table) Action: PutOn(x, y) Effect Axioms: x y s block(x) (block(y) y = M) ClearTop(x,s) ClearTop(y,s) On(x, y, result(puton(x,y), s)) x y w s block(x) (block(y) y = M) ClearTop(x,s) ClearTop(y,s) On(x, w, s) ClearTop(w, result(puton(x,y), s))

19 Situation Calculus Example (Blocks World) a b c Initial Situation Action Sequence? Final Situation c b a Predicates: On(x, y, s) ClearTop(x,s) Block(x) Objects: A B C M (blocks and table) Action: PutOn(x, y) Frame Axioms: x y z s On(x, y, s) (a PutOn(x,z)) On(x, y, Result(a, s)) x y w s ClearTop(y, s) (a PutOn(x,w)) ClearTop(y, Result(a, s))

20 Situation Calculus Example (Blocks World) a b c Initial Situation Action Sequence? Final Situation c b a Predicates: On(x, y, s) ClearTop(x,s) Block(x) Objects: A B C M (blocks and table) Action: PutOn(x, y) Resulting Successor State Axioms: x y z s On(x, y, result(a,s)) [ ( a=puton(x,y) On(x, z, s) ClearTop(x,s) ClearTop(y,s) Block(x) (Block(y) y=m) ) ( a PutOn(x,z) On(x, y, s) ) ] x y z s ClearTop(z, result(a,s)) [ ( a=puton(x,y) On(x, z, s) ClearTop(x,s) ClearTop(y,s) Block(x) (Block(y) y=m) ) ( a PutOn(x,z) On(x, y, s) ) ]

21 Situation Calculus Example (Blocks World) a b c Initial Situation Action Sequence? Final Situation c b a Predicates: On(x, y, s) ClearTop(x,s) Block(x) Objects: A B C M (blocks and table) Action: PutOn(x, y) Initial State: Block(A) Block(B) Block(C) On(C, M, s o ) On(B, C, s o ) On(A, B, s 0 ) ClearTop(A, s 0 ) Goal State: Block(A) Block(B) Block(C) On(A, M, s) On(B, A, s) On(C, B, s) ClearTop(C, s)

22 Complexity of Planning Problem The problem is intractable in the general case Simplifying assumptions: agent knows everything that is relevant for the planning problem agent knows how its available actions can change the world state from one state to another the planning agent is in control of the world the only state changes are the result of its deliberate actions the agent s preferred world states are constant during a planning episode Based on these assumptions, a typical approach is: first formulate the plan then execute it

23 Extensions of Planning Problem The real world surrounding the robot does not meet most of the simplifying assumptions, especially in dynamic, uncertain environments EXTENSIONS conditional planning: handles uncertainty by enumerating the possible states that may arise after the execution of an action and provides alternative courses of action for each of them plan monitoring and repair: during plan execution, progress is monitored and, when deviations from the predicted nominal conditions occur, the plan execution halts and a revised plan is created continual planning: in dynamic environments, one may allow context and/or agent s preferences changes and plan revision is an ongoing process rather than one triggered by failures of the nominal plan. Planning is not made in too much detail into the future, and it is interleaved with execution

Basic Planning Problem Formulation A possible formulation of the Planning problem is (Lavalle, 1996): 1. A nonempty state space, X, which is a finite or countably infinite set of states. 2.

24 Basic Planning Problem Formulation A possible formulation of the Planning problem is (Lavalle, 1996): 1. A nonempty state space, X, which is a finite or countably infinite set of states. 2. For each state, x X, a finite action space, U(x). 3. A state transition function, f, which produces a state, f(x; u) X, for every x X and u U(x). The state transition equation is derived from f as x = f(x; u). 4. An initial state, x I X. 5. A goal set, X G X.

25 Basic Planning Problem Formulation It is convenient to represent the planning problem as a directed state transition graph. The set of vertices is the state space, X. A directed edge from x X to x X exists in the graph if there exists an action u U(x) such that x = f(x; u). The initial state and goal set are designated as special vertices in the graph. Based on this formulation, several problem solving algorithms are available to find a feasible plan (i.e., one that leads from the initial to one of the goal states, not necessarily optimal). Examples: breadth-first depth-first best-first A*... Algorithms to solve Discrete Optimal Planning problems also exist, typically based on Dynamic Programming. In this case, we want to find the sequence of actions that lead to the goal set and optimize some criterion, such as distance traversed or energy spent.

26 Logic-Based Planning ADVANTAGES: build compact representations for discrete planning problems, when their regularity allows such compression convenient for producing output that logically explains the steps involved to arrive at some goal DISADVANTAGES difficult to generalize to enable concepts such as modeling uncertainty, unpredictability, sensing errors, and game theory to be incorporated into planning It is possible to convert the logic-based formulation into the graph-based formulation, e.g., the set of literals may be encoded as a binary string by imposing a linear ordering on the instances and predicates, and using 1s for true and 0 for false. This way, even optimal solutions can be found. However, the problem dimension may become intractable, even for a small number of predicates and instances e.g., for a constant number k of arguments per predicate, the space state dimension is 2 P I k, where P is the number of predicates, I the number of instances per predicate argument.

27 Logic-Based Planning A STRIPS-like Planning formulation is (Lavalle, 1996): 1. A nonempty set, I, of instances. 2. A nonempty set, P, of predicates, which are binary-valued (partial) functions of one of more instances. Each application of a predicate to a specific set of instances is called a positive literal if the predicate is true or a negative literal if it is false. 3. A nonempty set, O, of operators, each of which has: 1) preconditions, which is a set of positive and negative literals that must hold for the operator to apply, and 2) effects, which is a set of positive and negative literals that are the result of applying the operator. 4. An initial set, S, which is expressed as a set of positive literals. All literals not appearing in S are assumed to be negative. 5. A goal set, G, which is expressed as a set of both positive and negative literals.

28 Logic-Based Planning STRIPS (Stanford Research Institute Problem Solver) (Fikes, Nilsson, 1971) Example: mobile robot should move a box from room S3 to S2 room S1 P1 door P1 room S2 World Model (KB) inroom(robot, room s1) inroom(box, room s3) connects(door p1, room s1, room s2) connects(door p2, room s2, room s3) S1 R robot S2 S3 B box door P2 P2 room S3 Goal inroom(box, room s2) Plan (Action Sequence) move(robot, room s1, room s3) search(box) push(box, room s3, room s2, door p2)

29 Logic-Based Planning STRIPS (Stanford Research Institute Problem Solver) (Fikes, Nilsson, 1971) tasks are specified as well-formed-formulas or wff (predicate calculus) planning system attempts to find an action sequence that modifies the world models so as to make the wff TRUE to generate a plan, the effect of each action is modeled operator (actions over world model) world model Si add clauses remove clauses world model Si+1 clause set operator pre-conditions clause set 1. is goal clause in the current world model? YES: success NO: 2. search in the operator list one whose pre-conditions are satisfied and that, when applied to the current one, produces a new world model where the goal is closer to be satisfied 3. GoTo 1

30 Logic-Based Planning STRIPS and Situation Calculus STRIPS Pre-conditions: inroom(robot, room s1) connects(door p1, room s1, room s2) OPERATOR move(robot, room s1, room s2) Effects: Add: inroom(robot, room s2) Delete: inroom(robot, room s1) Situation Calculus a x s room(s2) inroom(robot, s2, Result(a,s)) [room(s1) (a = move(robot, s1, s2) inroom(robot, s1,s) (room (x) inroom(robot, s2,s) a move(robot, s2, x) ) ]

31 Plan Representation and Modeling How to model the right behavior? Behavior switching for a soccer robot lost_ball undribbable TakeBall2Goal no_ball success Score saw_ball AND ShouldIGo ClearBall lost_ball obstacle success success NOT ShouldIGo unreachable_ball OR lost_ball GetClose2Ball ShouldIGo saw_ball AND NOT ShouldIGo saw_ball AND ShouldIGo Standby saw_ball AND NOT ShouldIGo lost_ball success OR unreachable_posture OR (saw_ball AND NOT ShouldIGo) saw_ball AND ShouldIGo GoEmptySpot success OR unreachable_posture GoHome? success OR (NOT can_shoot_safely) success OR lost_ball

32 Plan Representation and Modeling Def.: A Petri net (PN) graph or structure is a weighted bipartite graph (P,T,A,w), where: P={p 1, p 2,... p n } is the finite set of places T ={t 1, t 2,... t m } is the finite set of transitions A ( P T) ( T P) is the set of arcs from places to transitions (p i,t j ) and transitions to places (t j,p i ) w: A 1,2,3, is the weight function on the arcs { } Set of input places to t j T I( t ) = { p P : ( p, t ) A} Set of output places from j i t j T O( t ) = { p P : ( t, p ) A} j i i j j i

33 Plan Representation and Modeling Def.: A marked Petri net is a five-tuple (P,T,A,w,x), where (P,T,A,w) is a Petri net graph and x is a marking of the set of n places P; x = [ x( p1), x( p2),, x( p n )] N is the row vector associated with x. Def. (PN dynamics): The state transition function, f : N T of Petri net (P,T,A,w,x), is defined for transition t j T iff x( pi ) w( pi, t j ), pi I( t j ). If f(x,t j ) is defined, the new state is x = f(x,t j ) where x' ( p ) = x( p ) w( p, t ) + w( t, p ), i = 1,, n. i i i j j i n Enabled t j N n

34 Plan Representation and Modeling Def. (Labeled Petri net): A labeled Petri net N is an eight-tuple N = ( P, T, A, w, E, l, x0, Xm) where ( P, T, A, w) is a PN graph E is the event set for transition labeling l : T E is the transition labeling function x 0 X N m n N is the initial state n is the set of marked states Def. (Languages generated and marked): L( N) : = { l( s) E : s T and f ( x0, s) is L m defined} ( N) : = { l( s) L( N) : s T and f ( x0, s) Xm}

35 Plan Representation and Modeling Petri Net Models of Robotic Tasks (Lima et al, 1998) (Milutinovic, Lima, 2002) Places with tokens represent resources available primitive actions running State is distributed over the places with tokens (PN marking) Events assigned to transitions and represent uncontrolled changes of state (e.g., caused by other agents or simply by the environment dynamics) controlled decisions to start a primitive action Transition fires when it is enabled and the labeling event occurs

36 Plan Representation and Modeling PN model of a single robot in a competition (Lima et al, 1998)

Plan Representation and Modeling (Lima et al, 1998) A Tool for Robotic Task Design and Distributed Execution Further developments in (Milutinovic, Lima, 2002)

37 Plan Representation and Modeling (Lima et al, 1998) A Tool for Robotic Task Design and Distributed Execution Further developments in (Milutinovic, Lima, 2002) vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 p 1 start robot_ready2move moving2ball catching_ball t 3 t 4 t 5 ball_catched ball_located ready2catch p 4 p 5 p 6

38 Plan Representation and Modeling Petri Nets (PN) Language Model Petri Net N E = { s, nf, bl, r2c, bc} l( t x 0 1 X m ) = s, l( t = [ ] = { x 0 2 ) = nf, l( t ) = bl,,[ ]} 3 vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 robot_ready2move moving2ball catching_ball p 1 start t 3 t 4 t 5 ball_catched p ball_located p ready2catch 4 5 p 6 x = x 0 = [ ] T marking or state Generated string: ε in L

39 Plan Representation and Modeling Petri Nets (PN) Language Model Petri Net N E = { s, nf, bl, r2c, bc} l( t x 0 1 X m ) = s, l( t = [ ] = { x 0 2 ) = nf, l( t ) = bl,,[ ]} 3 vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 robot_ready2move moving2ball catching_ball p 1 start t 3 t 4 t 5 ball_catched p ball_located p ready2catch 4 5 p 6 x = [ ] T marking or state Generated string: s in L

40 Plan Representation and Modeling Petri Nets (PN) Language Model Petri Net N E = { s, nf, bl, r2c, bc} l( t x 0 1 X m ) = s, l( t = [ ] = { x 0 2 ) = nf, l( t ) = bl,,[ ]} 3 vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 robot_ready2move moving2ball catching_ball p 1 start t 3 t 4 t 5 ball_catched p ball_located p ready2catch 4 5 p 6 x = [ ] T marking or state Generated string: s nf in L

41 Plan Representation and Modeling Petri Nets (PN) Language Model Petri Net N E = { s, nf, bl, r2c, bc} l( t x 0 1 X m ) = s, l( t = [ ] = { x 0 2 ) = nf, l( t ) = bl,,[ ]} 3 vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 robot_ready2move moving2ball catching_ball p 1 start t 3 t 4 t 5 ball_catched p ball_located p ready2catch 4 5 p 6 x = [ ] T marking or state Generated string: s nf bl in L

42 Plan Representation and Modeling Petri Nets (PN) Language Model Petri Net N E = { s, nf, bl, r2c, bc} l( t x 0 1 X m ) = s, l( t = [ ] = { x 0 2 ) = nf, l( t ) = bl,,[ ]} 3 vision_ready2locate_ball t 2 locating_ball standby t 1 p 2 new_frame p 3 robot_ready2move moving2ball catching_ball p 1 start t 3 t 4 t 5 ball_catched p ball_located p ready2catch 4 5 p 6 x = [ ] T Generated and Marked Languages marking or state L( N) = { ε, s, s Lm ( G) = { ε, s, s nf, s nf bl nf,...} nf bl r2c bc} L( G)

Plan Representation and Modeling Monitoring algorithms check the value of predicates over world state variables. Event occurrence means that a logical function of the predicates became true or false.

43 Plan Representation and Modeling Monitoring algorithms check the value of predicates over world state variables. Event occurrence means that a logical function of the predicates became true or false. Examples of events: found_ball: see(ball)=false see(ball)=true lost_ball: see(ball)=true see(ball)=false when see_ball AND closest_player2ball changes from false to true PN markings represent world states A plan to carry out a task is the sequence of primitive actions in a sequence of markings (world states) Plans are conditional, as resource places in markings represent logical pre-conditions for the execution of the next primitive actiion Example: primitive actions set X={GetCloseToBall, TakeBallToGoal, Score } Plan: GetCloseToBall. TakeBallToGoal. Score

44 Plan Representation and Modeling Event sequences (i.e., strings) are an equivalent representation of plans A language is the set of all possible plans for a robot Different language classes are equivalent to machine types used to represent and execute the task (Finite State Machine, PN,...) Of course, larger classes have an increased modeling power (e.g., PN languages vs regular/finite state machine languages) Do not confuse this with modeling elegance it is more natural to program with a rule-based system rather than with a state machine, but it is not necessarily more powerful (compare with C vs assembly)

45 Plan Representation and Modeling Abstraction Levels in Discrete Event Systems Untimed Timed Stochastic Timed e, e2,..., e 1 k,... time associated to events duration stochastic time associated to events FSA x, x1,..., x 0 k,... Timed FSA x(t) STA x( t) p( x( t)) e, e2,..., e 1 k,... time associated to transitions/ events duration stochastic time associated to transitions/events duration PN x 0,x 1,...,x k,... Timed PN x(t) SPN x( t) p( x( t))

46 Plan Qualitative Analysis Qualitative view/models enable answering analysis questions such as: will bad behaviors occur? will unsafe states be avoided? will we attempt to use more resources than those available? Qualitative view/models enable designing supervisors for specifications such as: eliminate substrings corresponding to bad behaviors avoid blocking ensure bounded usage of resources

47 Plan Qualitative Analysis Safety properties For all executions the system avoids a bad set of events or a set of bad strings is never generated or marked. The robot does not exhibit bad behaviours Blocking properties deadlocks or livelocks If the robot FSA blocks, the robot may get trapped into a no-return situation

48 Plan Qualitative Analysis Def. (Boundedness): Place p i P in PN N with initial state x 0 is said to be k-bounded, or k-safe, if x(p i ) k for all states x R(N), i.e., for all reachable states. This has to do with stability concerning the usage of resources available for a task (e.g., robots, tools, CPU, memory, ) Def. (Conservation): A PN N with initial state x 0 is said to be conservative with respect to γ = [γ 1, γ 2,..., γ ν ] if n γ x( ) = constant i p i i= 1 for all reachable states. This has to do with conservation of resources required for a task (e.g., robots, tools, CPU, memory, )

49 Plan Qualitative Analysis Def. (Liveness): A PN N with initial state x 0 is said to be live if there always exists some sample path such that any transition can eventually fire from any state reached from x 0. Liveness levels - a transition in a PN may be: Dead or L0-live, if the transition can never fire from this state L1-live, if there is some firing sequence from x 0 such that the transition can fire at least once L2-live, if the transition can fire at least k times for some given positive integer k L3-live, if there exists some infinite firing sequence in which the transition appears infinitely often L4-live, if the transition is L1-live for every possible state reached from x 0 L3 L1 L0 L2 This property is related to the reachability of given states, and with the repeatability of system states (e.g., error recovery and returning to the initial state)

50 Plan Quantitative Analysis Stochastic Models STOCHASTIC TIMED AUTOMATA (STA) STA with a Poisson clock structure is equivalent to a Markov Chain transition probabilities can be computed from STA transition probabilities and from the Poisson process rates for the events STOCHASTIC PETRI NET (SPN) SPN with Exponential timed transitions is equivalent to a Markov Chain transition probabilities can be computed from random switch and from the exponential rates for the events probabilities

51 Plan Quantitative Analysis Stochastic view/models enable answering analysis questions such as: what is the probability of success of a task plan? given a probability of success for the plan, how many steps (actions) will it take to accomplish the task? Stochastic view/models enable designing controllers for specifications such as: given some allowed number of steps for a plan, determine the plan that maximizes the probability of success given some desired probability of success, determine the plan that minimizes the number of required actions, or the accumulated action cost

Plan Quantitative Analysis Markov Property $ Pr{ s = s " s,a,s,a,,s,a % = Pr s $ t +1 t t t #1 t #1 0 0 & { = s " s,a % t +1 t t & Environments modeled by Stochastic Timed Automata satisfy the Markov

52 Plan Quantitative Analysis Markov Property $ Pr{ s = s " s,a,s,a,,s,a % = Pr s $ t +1 t t t #1 t #1 0 0 & { = s " s,a % t +1 t t & Environments modeled by Stochastic Timed Automata satisfy the Markov Property MARKOV DECISION PROCESSES (MDP) object on the table object grasped pickup release object on the floor effects of robot actions are uncertain but environment states are fully observable Solutions to MDPs come from Dynamic Programming Monte-Carlo Temporal Differences (e.g., reinforcement learning)

53 Reinforcement Learning (RL) Agent action a t reinforcement r t state s t s t+1 s a r t t S t+1 A( s R t ) r t+1 Environment Goal: choose the action sequence that maximizes = T k Rt γ r t + k+ 1, 0 γ 1 k = 0 T may go to infinity, as long as γ 1 Rewards and state transitions after an action is executed are stochastic.

54 Markov Decision Processes A RL task satisfying the Markov Property is known as a Markov Decision Process (MDP) Pr { s = sʹ, r = r s, a, r, s, a,, r, s, a { = Pr s = sʹ, r = r s, a t + 1 t + 1 t t t t 1 t t + 1 t + 1 t t Pa = Pr { st = s " s = s,a = a # s s " +1 t t $ % Ra = E rt s s " { s = s,a = a,s = " +1 t t t +1 s # $ % Transition probabilities Expected reward

55 Ex.: Recycling Robot 1, R wait wait 1 β, 3 robot has to be rescued because its battery is depleted search_trash _trash β, R search search_trash Battery High 1,0 recharge_battery Battery Low transition probability α _trash, R search search_trash _ trash α, R search search_trash expected reward action taken _trash 1 α, R search search_trash 1, R wait wait R search _ trash > R wait > 0 Number of cans collected while performing the corresponding tasks

56 ), ( )) (, ):(, ( Policy a s A s S a s a s π π Probability of carrying out action a in state s State value for policy π: (state, action) value for policy π: { } = = = = = + + s s r s s R s V t k k t k t t 0 1 γ π π π E E ) ( { } = = = = = = = + + a a s s r a a s s R a s Q t t k k t k t t t, E, E ), ( 0 1 γ π π π Expected value of starting in state s, carrying out action a, and following policy π thereafter. Value Functions Expected value of starting in state s and following policy π thereafter. NOTE: value of final state, if any, is always zero.

57 Value Functions relation between state value and Q function for policy π: Q is such that its value is the maximum discounted cumulative reward that can be achieved starting from state x and applying action a as the first action Q(s, a) = E{ r t+1 +!V! (s t+1 ) s t = s, a t = a} V! (s) = maxq(s, a') a' { } "Q(s, a) = E r t+1 +! maxq(s t+1, a') s t = s, a t = a a'

58 Bellman Equation for V π and Q π V π ( s) = E π a a π { R s = s} = π( s, a) P [ R + γv ( sʹ ) ] t t a A( s) sʹ S ssʹ ssʹ P R a ssʹ a ssʹ isthetransitionprobability from s to s' under action a istheaverage reward whenmoving from s to s' under action a This equation expresses a relation between the values of a state and its successors. For finite MDPs, unique solution: V π Optimal state value function: V ( s) = maxv π π ( s), s S Optimal (state, action) value function: Q ( s, a) = maxq ( s, a), s S, a A( s) { } Q! (s, a) = E r t+1 +! maxq(s t+1, a') s t = s, a t = a a' π π

59 Bellman Equation for V π and Q π s max a s,a r r s s V ( s) = max Q a A( s) π * ( s, a) = max a sʹ P a ssʹ [ R a ssʹ + γv ( sʹ )] max a state a a Q ( s, a) = Ps sʹ [ Rssʹ + γ maxq ( s, a )] aʹ sʹ ʹ ʹ (state, action)

60 Possible Approaches to the Solution of the RL Problem Dynamic Programming (DP) To determine V * for S = N, a system of N non-linear equations must be solved. Well established mathematical method. A complete model of the environment is required (P ss a and R ss a known). Often faces the curse of dimensionality [Bellman, 1957] Monte Carlo Similar to DP, but P ss a and R ss a unknown. P ss a and R ss a determined from the average of several trial-and-error trials. Unappropriate for a step-by-step incremental approximation of V *. Temporal Differences Knowledge of P ss a e R ss a is not required Step-by-step incremental approximation of V *. Mathematical analysis more complex. Q-learning

61 Q-Learning Once V * is known or learned, an apparently obvious solution for the RL problem would be:!! (s) = argmax a"a(s) E{ r(s, a)+"v! (#(s, a)) s t = s, a t = a},!(s, a) # state transition function... but r(s,a) and δ(s,a) are unknown in the general case However, if we know or learn Q *, a different solution arises: Q ( s, a) = E π ( s) = arg max Q { r + γv ( s ) s = s, a = a} a A( s) t+ 1 * ( s, a) t+ 1 t t In a stochastic environment, with unknown P ss a e R ss a, the agent s own experience when interacting with its environment can be used to learn Q * and π.

62 Q-Learning - Algorithm Initialize Q(s,a) random or arbitrarily Repeat forever (for each episode or trial): Initialize s Repeat (for each step n of the episode): Choose action a of s Execute action a and observe r and s Q 1( s, a) Q ( s, a) n+ n + αn r( s, a) + γ maxqn ( sʹ, aʹ ) a' s sʹ ; until s final. Q n ( s, a) α constant allows adaptability to slow environment changes but it does not guarantee convergence only possible with a temporal decay under given circumstances.

63 Q-Learning an Example G r(s,a) V * (s) α = 1 γ n n = G Q nπ (s,a)

64 Q-Learning an Example G r(s,a) V * (s) α γ n n = 1 = G Q nπ (s,a)

65 Q-Learning Another Example Initial Situation After some learning steps

66 Should each pair (s,a) be visited an infinite number of times, with < = < = = 1 2 ),, ( 1 ),, ( 1 0 i a s i n i a s i n n α α α Q-Learning Algorithm Convergence 1 )], ( ), ( ˆ Pr[lim, then = = a s Q a s Q a s n n

67 Action Selection: Exploration vs Exploitation Exploration: less promising actions, which may lead to good results, are tested. Exploitation: takes advantage of tested actions which are more promising, i.e., which have a larger Q(s,a). ε- greedy: at each step n, picks the best action so far with probability 1-ε, for small ε, but can also pick with probability ε, in an uniformly distributed random fashion, one of the other actions. softmax: at each step n, picks the action to be executed according to a Gibbs or Boltzmann distribution: π ( s, a) n = e Q n e b A( s) ( s, a) / τ Q n ( s, b) / τ

AUTONOMOUS SYSTEMS. Task Planning. Pedro U. Lima M. Isabel Ribeiro Luis Custódio

AUTONOMOUS SYSTEMS. Task Planning. Pedro U. Lima M. Isabel Ribeiro Luis Custódio AUTONOMOUS SYSTEMS Task Planning Pedro U. Lima M. Isabel Ribeiro Luis Custódio Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal March 2007 Revised by Pedro U. Lima in November