CAP Plan, Activity, and Intent Recognition

Similar documents
2534 Lecture 4: Sequential Decisions and Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

Some AI Planning Problems

Point-Based Value Iteration for Constrained POMDPs

Chapter 16 Planning Based on Markov Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Discrete planning (an introduction)

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Introduction to Artificial Intelligence (AI)

Piecewise Linear Dynamic Programming for Constrained POMDPs

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

CS 4649/7649 Robot Intelligence: Planning

Artificial Intelligence & Sequential Decision Problems

Outline. Lecture 13. Sequential Decision Making. Sequential Decision Making. Markov Decision Process. Stationary Preferences

Decision Theory: Markov Decision Processes

Decision Theory: Q-Learning

1 MDP Value Iteration Algorithm

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Planning and Acting in Partially Observable Stochastic Domains

Artificial Intelligence

Markov decision processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

, and rewards and transition matrices as shown below:

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Introduction to Artificial Intelligence (AI)

AM 121: Intro to Optimization Models and Methods: Fall 2018

Probability and Time: Hidden Markov Models (HMMs)

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Probabilistic Planning. George Konidaris

10 Robotic Exploration and Information Gathering

Markov Decision Processes

Partially Observable Markov Decision Processes (POMDPs)

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning. Introduction

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

An Adaptive Clustering Method for Model-free Reinforcement Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Efficient Maximization in Solving POMDPs

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Lecture 3: Markov Decision Processes

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Lecture 1: March 7, 2018

16.410/413 Principles of Autonomy and Decision Making

Grundlagen der Künstlichen Intelligenz

Planning Under Uncertainty II

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CSC321 Lecture 22: Q-Learning

Dialogue as a Decision Making Process

CS 5522: Artificial Intelligence II

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Reinforcement Learning and Control

CS 5522: Artificial Intelligence II

RL 14: POMDPs continued

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Internet Monetization

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement learning

Logic, Knowledge Representation and Bayesian Decision Theory

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Multiagent Value Iteration in Markov Games

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Optimally Solving Dec-POMDPs as Continuous-State MDPs

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

Final Exam December 12, 2017

Final Exam December 12, 2017

CS 188: Artificial Intelligence Spring 2009

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Reinforcement learning an introduction

Markov decision processes and interval Markov chains: exploiting the connection

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CSE250A Fall 12: Discussion Week 9

Real Time Value Iteration and the State-Action Value Function

CS599 Lecture 1 Introduction To RL

POMDPs and Policy Gradients

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

6 Reinforcement Learning

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

POMDP solution methods

Planning in Markov Decision Processes

Markov Chains and Hidden Markov Models

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

CS 343: Artificial Intelligence

Temporal Difference Learning & Policy Iteration

CS 570: Machine Learning Seminar. Fall 2016

Introduction to Mobile Robotics Probabilistic Robotics

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning. Machine Learning, Fall 2010

16.4 Multiattribute Utility Functions

Reasoning Under Uncertainty Over Time. CS 486/686: Introduction to Artificial Intelligence

State Space Abstraction for Reinforcement Learning

Bayes-Adaptive POMDPs 1

Transcription:

CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1

Reminder Turn-in questionnaire Homework (due Thurs): 1 page descriing the improvements that you plan to make to your project in the next half of the semester SP2-2

Model: POMDP Applications? Strengths? Weaknesses? How does a POMDP differ from an HMM? SP2-3

Model: POMDP Applications? Human-root interaction Dialog management Assistive technology Agent interaction Strengths? Integrates action selection directly with state estimation Weaknesses? Intractale for real-world domains How does a POMDP differ from an HMM? MDP and POMDP are for calculating optimal decisions from sequences of oservations; HMMs are for recognizing hidden state.from sequences of oservations. MDP and POMDP: actions and rewards SP2-4

Markov Decision Processes Classical planning models: logical representation of transition systems goal-ased ojectives plans as sequences Markov decision processes generalize this view controllale, stochastic transition system general ojective functions (rewards) that allow tradeoffs with transition proailities to e made more general solution concepts (policies) SP2-5

Markov Decision Processes An MDP has four components, S, A, R, Pr: (finite) state set S ( S = n) (finite) action set A ( A = m) transition function Pr(s,a,t) each Pr(s,a,-) is a distriution over S represented y set of n x n stochastic matrices ounded, real-valued reward function R(s) represented y an n-vector can e generalized to include action costs: R(s,a) can e stochastic (ut replacale y expectation) Model easily generalizale to countale or continuous state and action spaces SP2-6

System Dynamics Finite State Space S State s 1013 : Loc = 236 Joe needs printout Craig needs coffee... SP2-7

System Dynamics Finite Action Space A Pick up Printouts? Go to Coffee Room? Go to charger? SP2-8

System Dynamics Transition Proailities: Pr(s i, a, s j ) Pro. = 0.95 SP2-9

System Dynamics Transition Proailities: Pr(s i, a, s k ) Pro. = 0.05 s 1 s 2... s n s 1 0.9 0.05... 0.0 s 2 0.0 0.20... 0.1. s n 0.1 0.0... 0.0 SP2-10

Reward Process Reward Function: R(s i ) - action costs possile Reward = -10 R s 1 12 s 2 0.5.. s n 10 SP2-11

Assumptions Markovian dynamics (history independence) Pr(S t+1 A t,s t,a t-1,s t-1,..., S 0 ) = Pr(S t+1 A t,s t ) Markovian reward process Pr(R t A t,s t,a t-1,s t-1,..., S 0 ) = Pr(R t A t,s t ) Stationary dynamics and reward Pr(S t+1 A t,s t ) = Pr(S t +1 A t,s t ) for all t, t Full oservaility though we can t predict what state we will reach when we execute an action, once it is realized, we know what it is SP2-12

Policies Nonstationary policy π:s x T A π(s,t) is action to do at state s with t-stages-to-go Stationary policy π:s A π(s) is action to do at state s (regardless of time) analogous to reactive or universal plan These assume or have these properties: full oservaility history-independence deterministic action choices MDP and POMDPs are methods for calculating the optimal lookup tales (policies). SP2-13

Value of a Policy How good is a policy π? How do we measure accumulated reward? Value function V: S R associates value with each state (sometimes S x T) V π (s) denotes value of policy at state s how good is it to e at state s? depends on immediate reward, ut also what you achieve susequently expected accumulated reward over horizon of interest note V π (s) R(s); it measures utility SP2-14

Value of a Policy (con t) Common formulations of value: Finite horizon n: total expected reward given π Infinite horizon discounted: discounting keeps total ounded SP2-15

Value Iteration (Bellman 1957) Markov property allows exploitation of DP principle for optimal policy construction no need to enumerate A Tn possile policies Value Iteration V V 0 k ( s) ( s) = R( s), = π *( s, k) R( s) = s max k + Pr( s, a, s') V ( ') ' 1 s s a arg max a Pr( s, a, s' s') V Vk is optimal k-stage-to-go value function Bellman ackup k 1 ( s') SP2-16

Value Iteration V t-2 V t-1 V t V t+1 s1 0.7 0.7 0.7 s2 0.4 0.4 0.4 s3 0.6 0.6 0.6 0.3 0.3 0.3 s4 V t (s4) = R(s4)+max { 0.7 V t+1 (s1) + 0.3 V t+1 (s4) 0.4 V t+1 (s2) + 0.6 V t+1 (s3) } SP2-17

Value Iteration V t-2 V t-1 V t V t+1 s1 0.7 0.7 0.7 s2 0.4 0.4 0.4 s3 0.6 0.6 0.6 0.3 0.3 0.3 s4 Π t (s4) = max { } SP2-18

Value Iteration SP2-19

Complexity T iterations At each iteration A computations of n x n matrix times n-vector: O( A n 3 ) Total O(T A n3) Can exploit sparsity of matrix: O(T A n2) SP2-20

MDP Application: Electric Elves Calculating optimal transfer of control policy in an adjustale autonomy application Dynamically adjusts users meetings State of world is known; future actions of users are unknown SP2-21

Recognizing User Intent MDP POMDP SP2-22

POMDPs Partially oservale Markov Decision Process (POMDP): a stochastic system Σ = (S, A, P) as efore A finite set O of oservations P a (o s) = proaility of oservation o in state s after executing action a Require that for each a and s, o in O P a (o s) = 1 O models partial oservaility The controller can t oserve s directly; it can only oserve o The same oservation o can occur in more than one state Why do the oservations depend on the action a? Why do we have P a (o s) rather than P(o s)? This is a way to model sensing actions, which do not change the state ut return information make some oservation availale (e.g., from a sensor) SP2-23

Example of a Sensing Action Suppose there are a state s 1 action a 1, and oservation o 1 with the following properties: For every state s, Pa 1 (s s) = 1 a 1 does not change the state Pa 1 (o 1 s 1 ) = 1, and Pa 1 (o 1 s) = 0 for every state s s 1 After performing a 1, o 1 occurs if and only if we re in state s 1 Then to tell if you re in state s 1, just perform action a 1 and see whether you oserve o 1 Two states s and s are indistinguishale if for every o and a, P a (o s) = P a (o s ) SP2-24

Belief States At each point we will have a proaility distriution (s) over the states in S (s) is called a elief state (our elief aout what state we re in) Basic properties: 0 (s) 1 for every s in S s in S (s) = 1 Definitions: a = the elief state after doing action a in elief state Thus a (s) = P(in s after doing a in ) = s' in S P a (s s') (s') a (o) = P(oserve o after doing a in ) Marginalize over states = s in S P a (o s) (s) ao (s) = P(in s after doing a in and oserving o) Belief states are n-dimensional vectors representing the proaility of eing in every state.. SP2-25

Belief States (Continued) Recall that in general, P(x y,z) P(y z) = P(x,y z) Thus P a (o s) a (s) = P(oserve o after doing a in s) P(in s after doing a in ) = P(in s and oserve o after doing a in ) Similarly, ao (s) a (o) = P(in s after doing a in and oserving o) * P(oserve o after doing a in ) = P(in s and oserve o after doing a in ) Thus ao (s) = P a (o s) a (s) / a (o) Formula for updating elief state Can use this to distinguish states that would otherwise e indistinguishale SP2-26

state Example Root r1 can move etween l1 and l2 move(r1,l1,l2) move(r1,l2,l1) There may e a container c1 in location l2 in(c1,l2) a = move(r1,l1,l2) O = {full, empty} full: c1 is present empty: c1 is asent areviate full as f, and empty as e a a state a a a SP2-27

Example (Continued) state Neither move action returns useful oservations For every state s and for a = either move action, P a (f s) = P a (e s) = P a (f s) = P a (e s) = 0.5 a = move(r1,l1,l2) Thus if there are no other actions, then state a s1 and s2 are indistinguishale s3 and s4 are indistinguishale a a a a SP2-28

Example (Continued) state Suppose there s a sensing action see that works perfectly in location l2 P see (f s4) = P see (e s3) = 1 P see (f s3) = P see (e s4) = 0 see does not work elsewhere P see (f s1) = P see (e s1) = P see (f s2) = P see (e s2) = 0.5 Then s1 and s2 are still indistinguishale s3 and s4 are now distinguishale a a SP2-29 a = move(r1,l1,l2) state a a a

Example (Continued) By itself, see doesn t tell us the state with certainty seee (s3) = P see (e s3) * see (s3) / see (e) = 1 * 0.25 / 0.5 = 0.5 If we first do a=move(l1,l2) then do see, this will tell the state with certainty state a = move(r1,l1,l2) Let ' = a ' seee (s3) = P see (e s3) * ' see (s3) / ' see (e) = 1 * 0.5 / 0.5 = 1 a state ' = a a a a SP2-30

Modified Example state Suppose we know the initial elief state is Policy to tell if there s a container in l2: π = {(, move(r1,l1,l2)), (', see)} a = move(r1,l1,l2) state ' = a a a a a SP2-31

Solving POMDPs Information-state MDPs Belief states of POMDP are states in new MDP Continuous state space Discretise Policy-tree algorithms SP2-32

Policy Trees Policy tree: an agent s non-stationary t-step policy Tree(a,T) create a new policy tree with action a at root and oservation z=t(z) Vp vector for value function for policy tree p with one component per state Act(p) action at root of tree p Sutree(p,z) sutree of p after os z Stval(a,z,p) vector for proaility-weighted value of tree p after a,z SP2-33

Application: Nurseot Root assists elderly patients Model uncertainty aout the user s dialog and position Exploit hierarchical structure to handle large state space SP2-34

Value Functions 2-state 3-state SP2-35

References Most slides were taken from Eyal Amir s course, CS 598, Decision Making under Uncertainty (lectures 12 and 13) L. Kaeling, M. Littman, and A. Cassandra, Planning and Acting in Partially Oservale Stochastic Domains, Artificial Intelligence, Volume 101, pp. 99-134, 1998 SP2-36