Some AI Planning Problems

Similar documents
Markov Decision Processes Infinite Horizon Problems

CS 7180: Behavioral Modeling and Decisionmaking

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Decision Theory: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning Active Learning

CS599 Lecture 1 Introduction To RL

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Planning in Markov Decision Processes

CAP Plan, Activity, and Intent Recognition

Reinforcement Learning and Control

Markov Decision Processes Chapter 17. Mausam

Chapter 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning. Introduction

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov Decision Processes

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Decision Theory: Q-Learning

Introduction to Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Markov Decision Processes and Solving Finite Problems. February 8, 2017

CS 598 Statistical Reinforcement Learning. Nan Jiang

Markov Decision Processes Chapter 17. Mausam

, and rewards and transition matrices as shown below:

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Internet Monetization

Probabilistic Planning. George Konidaris

Markov decision processes

Basics of reinforcement learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. George Konidaris

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement learning an introduction

CSE250A Fall 12: Discussion Week 9

Lecture 1: March 7, 2018

Chapter 16 Planning Based on Markov Decision Processes

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Reinforcement Learning

Reinforcement learning

AM 121: Intro to Optimization Models and Methods: Fall 2018

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Real Time Value Iteration and the State-Action Value Function

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Reinforcement Learning

An Introduction to Markov Decision Processes. MDP Tutorial - 1

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 23: Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

The Markov Decision Process (MDP) model

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Lecture notes for Analysis of Algorithms : Markov decision processes

CS 188: Artificial Intelligence Spring Announcements

Markov Decision Processes and Dynamic Programming

A Gentle Introduction to Reinforcement Learning

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

CS 234 Midterm - Winter

1 MDP Value Iteration Algorithm

Discrete planning (an introduction)

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Machine Learning I Reinforcement Learning

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Distributed Optimization. Song Chong EE, KAIST

Autonomous Helicopter Flight via Reinforcement Learning

6 Reinforcement Learning

Artificial Intelligence

Markov Decision Processes and Dynamic Programming

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Q-Learning in Continuous State Action Spaces

An Adaptive Clustering Method for Model-free Reinforcement Learning

CS 4100 // artificial intelligence. Recap/midterm review!

CSC321 Lecture 22: Q-Learning

Elements of Reinforcement Learning

Sequential Decision Problems

Reinforcement Learning. Machine Learning, Fall 2010

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

Multiagent Value Iteration in Markov Games

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Reinforcement Learning

Transcription:

Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533 Student in subject and repeat if you don t hear back within a day Course website (link on instructor s home page) has Lecture notes, Assignments, and HW Solutions Written Homework: Assigned and collected regularly (submission via email is accepted) Not graded for correctness, but rather for completion with good effort (no guarantee on when it will be returned---make a copy if you want one) Must complete with good effort 90% of homework or a letter grade will be deducted from final grade Grade based on: 25% Midterm exam (in class) 25% Final exam (take home) 25% Final Project (during last month) 25% 3 mini-projects (require some implementation) 1

Some AI Planning Problems Fire & Rescue Response Planning Solitaire Real-Time Strategy Games Helicopter Control Legged Robot Control Network Security/Control

Some AI Planning Problems Health Care Personalized treatment planning Hospital Logistics/Scheduling Transportation Autonomous Vehicles Supply Chain Logistics Air traffic control Assistive Technologies Automated assistants for elderly/disabled Household robots Sustainability Smart grid Forest fire management.. 3

Common Elements We have a controllable system that can change state over time (in some predictable way) The state describes essential information about system (the visible card information in Solitaire) We have an objective that specifies which states, or state sequences, are more/less preferred Can (partially) control the system state transitions by taking certain actions Problem: At each moment must select an action to optimize the overall objective Produce most preferred state sequences 4

Some Dimensions of AI Planning Observations fully observable vs. partially observable World???? Goal Actions sole source of change vs. other sources deterministic vs. stochastic instantaneous vs. durative 5

Classical Planning Assumptions (primary focus of AI planning until early 90 s) Observations World Actions sole source of change fully observable???? deterministic instantaneous Goal achieve goal condition 6

Classical Planning Assumptions (primary focus of AI planning until early 90 s) Observations World Actions sole source of change fully observable???? deterministic instantaneous Goal achieve goal condition Greatly limits applicability 7

Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Observations fully observable World???? Actions sole source of change stochastic instantaneous Goal maximize expected reward over lifetime We will primarily focus on MDPs 8

Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model World State Probabilistic state transition (depends on action)???? Action from finite set Goal maximize expected reward over lifetime 9

Example MDP State describes all visible info about cards???? Action are the different legal card movements Goal win the game or play max # of cards 10

Course Outline Structured around algorithms for solving MDPs Different assumptions about knowledge of MDP model Different assumptions about how MDP is represented 1. Markov Decision Processes (MDPs) Basics Basic definitions and solution techniques Assume an exact MDP model is known Exact solutions for small/moderate size problems 2. Monte-Carlo Planning Assumes an MDP simulator is available Approximate solutions for large problems 3. Reinforcement learning MDP model is not known to agent Exact solutions for small/moderate problems Approximate solutions for large problems 4. Planning w/ Factored Representations of Huge MDPs Symbolic Dynamic Programming Classical planning for deterministic problems 11

Markov Decision Processes Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 12

Markov Decision Processes An MDP has four components: S, A, R, T: finite state set S ( S = n) finite action set A ( A = m) transition function T(s,a,s ) = Pr(s s,a) Probability of going to state s after taking action a in state s How many parameters does it take to represent? m*n*(n-1) bounded, real-valued reward function R(s) Immediate reward we get for being in state s Roughly speaking the objective is to select actions in order to maximize total reward For example in a goal-based domain R(s) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states) 13

Graphical View of MDP At At+1 St St+1 St+2 Rt Rt+1 Rt+2 14

Assumptions First-Order Markovian dynamics (history independence) Pr(S t+1 A t,s t,a t-1,s t-1,..., S 0 ) = Pr(S t+1 A t,s t ) Next state only depends on current state and current action State-Dependent Reward R t = R(S t ) Reward is a deterministic function of current state Stationary dynamics Pr(S t+1 A t,s t ) = Pr(S k+1 A k,s k ) for all t, k The world dynamics and reward function do not depend on absolute time Full observability Though we can t predict exactly which state we will reach when we execute an action, after the action is executed, we see what the state is 15

What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output:???? Should the solution to an MDP be just a sequence of actions such as (a1,a2,a3,.)? Consider a single player card game like Blackjack/Solitaire. No! In general an action sequence is not sufficient Actions have stochastic effects, so the state we end up in is uncertain This means that we might end up in states where the remainder of the action sequence doesn t apply or is a bad choice A solution should tell us what the best action is for any possible situation/state that might arise 16

Policies ( plans for MDPs) A solution to an MDP is a policy Two types of policies: nonstationary and stationary Nonstationary policies are used when we are given a finite planning horizon H I.e. we are told how many actions we will be allowed to take Nonstationary policies are functions from states and times to actions π:s x T A, where T is the non-negative integers π(s,t) tells us what action to take at state s when there are t stages-to-go (note that we are using the convention that t represents stages/decisions to go, rather than the time step) 17

Policies ( plans for MDPs) What if we want to continue taking actions indefinately? Use stationary policies A Stationary policy is a mapping from states to actions π:s A π(s) is action to do at state s (regardless of time) specifies a continuously reactive controller Note that both nonstationary and stationary policies assume or have these properties: full observability history-independence deterministic action choice 18

What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that???? We don t want to output just any policy We want to output a good policy One that accumulates lots of reward 19

Value of a Policy How good is a policy π? How do we measure reward accumulated by π? Value function V: S R associates value with each state (or each state and time for non-stationary π) V π (s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state The goal of MDP planning is to compute an optimal policy 20

What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an optimal value This depends on how we define the value of a policy There are several choices and the solution algorithms depend on the choice We will consider two common choices Finite-Horizon Value Infinite Horizon Discounted Value 21

Finite-Horizon Value Functions We first consider maximizing expected total reward over a finite horizon Assumes the agent has H time steps to live To act optimally, should the agent use a stationary or non-stationary policy? I.e. Should the action it takes depend on absolute time? Put another way: If you had only one week to live would you act the same way as if you had fifty years to live? 22

Finite Horizon Problems Value (utility) depends on stage-to-go hence use a nonstationary policy V k (s) is k-stage-to-go value function for π expected total reward for executing π starting in s for k time steps k k t V ( s) E E [ [ t0 k t0 R R( s, s t ) ( s, k t), s Here Rt and st are random variables denoting the reward received and state at time-step t when starting in s These are random variables since the world is stochastic a t ] t 0 s ] 23

Computational Problems There are two problems that we will be interested in solving Policy evaluation: Given an MDP and a nonstationary policy π Compute finite-horizon value function V k (s) for any k Policy optimization: Given an MDP and a horizon H Compute the optimal finite-horizon policy We will see this is equivalent to computing optimal value function How many finite horizon policies are there? A Hn So can t just enumerate policies for efficient optimization 24

Finite-Horizon Policy Evaluation Can use dynamic programming to compute (a) Markov property is critical for this V 0 ( s) R( s), s V k (s) (b) V k ( s) R( s) immediate reward s π(s,k) a 1 0.7 s1 T( s, ( s, k), s') V s' k expected future payoff with k-1 stages to go 1 ( s') 0.3 S2 What is total time complexity? Vk Vk-1 O(Hn 2 ) 25

Finite-Horizon Policy Evaluation Can use dynamic programming to compute (a) Markov property is critical for this V 0 ( s) R( s), s V k (s) (b) V k ( s) R( s) immediate reward s π(s,k) a 1 0.7 s1 T( s, ( s, k), s') V s' k expected future payoff with k-1 stages to go 1 ( s') 0.3 S2 What is total time complexity? Vk Vk-1 O(kn 2 ) 26

Policy Optimization: Bellman Backups How can we compute the optimal V t+1 (s) given optimal V t? Compute Max V t+1 (s) s Compute Expectations a 1 0.4 0.7 0.3 V t s1 s2 s3 a 2 0.6 s4 V t+1 (s) = R(s)+max { 0.7 V t (s1) + 0.3 V t (s4) 0.4 V t (s2) + 0.6 V t (s3) } 27

Value Iteration: Finite Horizon Case Markov property allows exploitation of DP principle for optimal policy construction no need to enumerate A Hn possible policies Value Iteration V V 0 k ( s) ( s) *( s, k) R( s), s R( s) max arg max a T( s, a, s') V s' k a T( s, a, s') V s' Vk is optimal k-stage-to-go value function Π*(s,k) is optimal k-stage-to-go policy Bellman backup 1 ( s') ( k 1 s ') 28

Value Iteration V 3 V 2 V 1 V 0 s1 0.7 0.7 0.7 s2 0.4 0.4 0.4 s3 0.6 0.6 0.6 0.3 0.3 0.3 s4 V 1 (s4) = R(s4)+max { 0.7 V 0 (s1) + 0.3 V 0 (s4) 0.4 V 0 (s2) + 0.6 V 0 (s3) } 29

Value Iteration V 3 V 2 V 1 V 0 s1 0.7 0.7 0.7 s2 0.4 0.4 0.4 s3 0.6 0.6 0.6 0.3 0.3 0.3 s4 P * (s4,t) = max { } 30

Value Iteration: Complexity Note how DP is used optimal soln to k-1 stage problem can be used without modification as part of optimal soln to k-stage problem What is the computational complexity? H iterations At each iteration, each of n states, computes expectation for A actions Each expectation takes O(n) time Total time complexity: O(H A n2) Polynomial in number of states. Is this good? 31

Summary: Finite Horizon Resulting policy is optimal V k k *( s) V ( s),, s, k convince yourself of this (use induction on k) Note: optimal value function is unique, but optimal policy is not (can be ties between actions) Many policies can have same value 32

Discounted Infinite Horizon MDPs Defining value as total reward is problematic with infinite horizons many or all policies have infinite expected reward some MDPs are ok (e.g., zero-cost absorbing states) Trick : introduce discount factor 0 β < 1 future rewards discounted by β per time step Note: V V ( s) E [ t0 t0 t R t, s ] t max 1 ( s) E [ R ] R 1 Bounded Value max Motivation: economic? prob of death? convenience? 33

Notes: Discounted Infinite Horizon Optimal policies guaranteed to exist (Howard, 1960) I.e. there is a policy that maximizes value at each state Furthermore there is always an optimal stationary policy Intuition: why would we change action at s at a new time when there is always forever ahead We define V *( s) V ( s) for some optimal stationary π 34

Policy Evaluation Value equation for fixed policy Immediate reward + Expected discounted future reward V ( s) R( s) β T( s, ( s), s') V ( s') s' How can we compute the value function for a policy? we are given R and T derive this from original definition linear system with n variables and n constraints Variables are values of states: V(s1),,V(sn) Constraints: one value equation (above) per state Use linear algebra to solve for V (e.g. matrix inverse) 35

Policy Evaluation via Matrix Inverse V π and R are n-dimensional column vector (one element for each state) T is an nxn matrix s.t. V ( I V R βt) V ( I βtv T(i, j) T(si, ( s i ), s j R -1 βt) R ) 36

Computing an Optimal Value Function Bellman equation for optimal value function V *( s) R( s) β max T( s, a, s') V *( s') s' a Bellman proved this is always true for an optimal value function How can we compute the optimal value function? The MAX operator makes the system non-linear, so the problem is more difficult than policy evaluation Notice that the optimal value function is a fixed-point of the Bellman Backup operator B (i.e. B[V*]=V*) B takes a value function as input and returns a new value function B[ V ]( s) R( s) β max T( s, a, s') V ( s') s' a 37

Value Iteration Can compute optimal policy using value iteration based on Bellman backups, just like finite-horizon problems (but include discount term) V 0 ( s) 0 V k ( s) R( s) max T( s, a, s') V s' k1 a ( s') Will it converge to optimal value function as k gets large? Yes. lim V k k V When should we stop in practice? * 38

Convergence: Contraction Property B[V] is a contraction operator on value functions That is, operator B satisfies: For any V and V, B[V] B[V ] β V V Here V is the max-norm, which returns the maximum element of the vector. E.g. (0.1 100 5 6) = 100 So applying a Bellman backup to any two value functions causes them to get closer together in the max-norm sense! 39

Convergence Using the contraction property we can prove convergence of value iteration. Proof: 1. For any V: V* - B[V] = B[V*] B[V] β V* - V 2. So applying Bellman backup to any value function V brings us closer to V* by a constant factor β in maxnorm sense 3. This means that V k V* β k V* - V 0 * k 4. Thus lim V V 0 k 40

Stopping Condition Want to stop when we can guarantee the value function is near optimal. Key property: If Vk - Vk-1 ε then Vk V* εβ /(1-β) You ll show this in your homework Continue iteration until Vk - Vk-1 ε Select small enough ε for desired error guarantee 41

How to Act Given a V k from value iteration that closely approximates V*, what should we use as our policy? Use greedy policy: (one step lookahead) k k greedy[ V ]( s) arg max T( s, a, s') V ( s') s' a Note that the value of greedy policy may not be equal to V k Let V G be the value of the greedy policy? How close is V G to V*? I.e. how close is our greedy policy to optimal after k iterations 42

How to Act Given a V k that closely approximates V*, what should we use as our policy? Use greedy policy: (one step lookahead) k k greedy[ V ]( s) arg max T( s, a, s') V ( s') s' a This selects the action that looks best if we assume that we get value V k in one step How good is this policy? 43

Value of Greedy Policy k k greedy[ V ]( s) arg max T( s, a, s') V ( s') s' a Define V g to be the value of this greedy policy This is likely not the same as V k (convince yourself of this) Property: If V k V* λ then V g - V* 2λβ /(1-β) Thus, V g is not too far from optimal if V k is close to optimal Set stopping condition so that V g has desired accuracy Furthermore, there is a finite ε s.t. greedy policy is optimal That is, even if value estimate is off, greedy policy is optimal once it is close enough. Why? 44

Optimization via Policy Iteration Recall, given policy, can compute its value exactly: V ( s) R( s) T( s, ( s), s') V ( s') s' Policy iteration exploits this: iterates steps of policy evaluation and policy improvement 1. Choose a random policy π 2. Loop: Policy improvement (a) Evaluate V π (b) For each s in S, set (c) Replace π with π '( s) arg max Until no improving action possible at any state a T( s, a, s') V s' ( s') 45

Policy Iteration: Convergence Policy improvement guarantees that π is no worse than π. Further if π is not optimal then π is strictly better in at least one state. Local improvements lead to global improvement! For proof sketch see http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node42.html I ll walk you through a proof in a HW problem Convergence assured No local maxima in value space (i.e. an optimal policy exists) Since finite number of policies and each step improves value, then must converge to optimal Gives exact value of optimal policy 46

Policy Iteration Complexity Each iteration runs in polynomial time in the number of states and actions There are at most A n policies and PI never repeats a policy So at most an exponential number of iteations Not a very good complexity bound Empirically O(n) iterations are required Challenge: try to generate an MDP that requires more than that n iterations Still no polynomial bound on the number of PI iterations (open problem)! But maybe not anymore.. 47

Value Iteration vs. Policy Iteration Which is faster? VI or PI It depends on the problem VI takes more iterations than PI, but PI requires more time on each iteration PI must perform policy evaluation on each iteration which involves solving a linear system VI is easier to implement since it does not require the policy evaluation step We will see that both algorithms will serve as inspiration for more advanced algorithms 48

Recap: things you should know What is an MDP? What is a policy? Stationary and non-stationary What is a value function? Finite-horizon and infinite horizon How to evaluate policies? Finite-horizon and infinite horizon Time/space complexity? How to optimize policies? Finite-horizon and infinite horizon Time/space complexity? Why they are correct? 49