Factored State Spaces 3/2/178

Similar documents
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

RL 14: POMDPs continued

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

MA 3280 Lecture 05 - Generalized Echelon Form and Free Variables. Friday, January 31, 2014.

Lecture 3: The Reinforcement Learning Problem

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Decision Theory: Markov Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Chapter 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Markov Decision Processes

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Decision Theory: Q-Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning and Control

Partially Observable Markov Decision Processes (POMDPs)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

CS 4700: Foundations of Artificial Intelligence Ungraded Homework Solutions

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

5.2 Infinite Series Brian E. Veitch

CSC321 Lecture 22: Q-Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

A Gentle Introduction to Reinforcement Learning

, and rewards and transition matrices as shown below:

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

Chapter 16 Planning Based on Markov Decision Processes

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

1 MDP Value Iteration Algorithm

Optimally Solving Dec-POMDPs as Continuous-State MDPs

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Lecture 1: March 7, 2018

Basics of reinforcement learning

Lecture XI. Approximating the Invariant Distribution

Introduction to Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Machine Learning I Reinforcement Learning

Some AI Planning Problems

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Approximate Universal Artificial Intelligence

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement Learning. George Konidaris

Reinforcement Learning II

Partially Observable Markov Decision Processes (POMDPs)

RL 14: Simplifications of POMDPs

CS181 Midterm 2 Practice Solutions

Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Lecture 23: Reinforcement Learning

Reinforcement learning an introduction

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CS 4100 // artificial intelligence. Recap/midterm review!

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning Part 2

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Artificial Intelligence

Quadratic Equations Part I

Lecture 10: Powers of Matrices, Difference Equations

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning: An Introduction

Reinforcement Learning Active Learning

Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Markov Decision Processes Infinite Horizon Problems

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max P (s s, a)u(s )

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Reinforcement Learning and Deep Reinforcement Learning

Planning Under Uncertainty II

Nonparametric Learning Rules from Bandit Experiments: The Eyes Have It!

CS 188: Artificial Intelligence Spring Announcements

Commutative Rings and Fields

CS599 Lecture 1 Introduction To RL

CS 7180: Behavioral Modeling and Decisionmaking

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

CS 570: Machine Learning Seminar. Fall 2016

CS 188: Artificial Intelligence

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Lec 9: Stellar Evolution and DeathBirth and. Why do stars leave main sequence? What conditions are required for elements. Text

Computer Systems on the CERN LHC: How do we measure how well they re doing?

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Transcription:

Factored State Spaces 3/2/178

Converting POMDPs to MDPs In a POMDP: Action + observation updates beliefs Value is a function of beliefs. Instead we can view this as an MDP where: There is a state for every possible belief. Beliefs are probabilities, so we have a continuum. There are infinitely many belief-states. Taking an action transitions to another belief-state. Observations are random, so this transition is random.

Value Iteration in POMDPs Value iteration in a finite MDP: 1. Initialize each state s value to 0. 2. Compute the greedy policy for each state. 3. Update the value of each state based on this policy. 4. Goto step 2; repeat until converged. In a POMDP, there are infinitely many states. We can t loop through them. Value is a piecewise-linear function of belief. We can do value iteration over a finite set of linear functions. For a description of the algorithms continue reading pomdp.org.

POMDP value iteration is impractical In principle, we can iterate over a finite set of linear functions to update values. In practice, this gets out of hand very quickly. This example is from a 2-state, 2-action MDP, computing values at horizon 3.

What can we do instead? Approximation Don t find optimal values for every state. Instead solve for pretty-good values for groups of states. Online planning Don t bother devising a complete policy. Instead come up with a pretty-good policy for the short term, and re-plan in the future. Reinforcement learning Combines aspects of approximation and online planning. We ll focus on RL after spring break.

What is the running time of Value Iteration on MDPs? repeat until convergence: (update values table)... for each state: (update state s value)... for each action: (find the best action)... for each next state: (update expected value)...

<latexit sha1_base64="r0rtd7kbj5xxwjrmrjkcyez+9gi=">aaaccxicbvdnssnagnz4w+tf1kox1sj4kokikigu9ocxgrgfjobnztmu3wzc7kyoiwcvvooxdypefqnvvo2bngdthvgyzubj22+clfgplovbmjtfwfxarq3uv9fwnzbnre07mwqcewcnlbhdaenckceoooqrbioiigngoshwsvq7d0rimvbbnuqjf6m+pxhfsgnjn/fcvcshn9mlu7jphwv0my4reovt6iqs+mbdalpjwfliv6qbkrr988sne5zfhcvmkjq920qvlyohkgakqluzjcncq9qnpu05ion08vepbtzqsgijrojhfryrvydyfes5igodjjeaygmvfp/zepmktr2c8jrthopjoihjucww7awgvbcs2egthaxvf4v4gatcsrdx1yxy0yfpeueoeda0b44brfoqjrrybfvgenjgbltanwgdb2dwcj7bk3gznowx4934metnjgpmb/yb8fkdboiarw==</latexit> <latexit sha1_base64="r0rtd7kbj5xxwjrmrjkcyez+9gi=">aaaccxicbvdnssnagnz4w+tf1kox1sj4kokikigu9ocxgrgfjobnztmu3wzc7kyoiwcvvooxdypefqnvvo2bngdthvgyzubj22+clfgplovbmjtfwfxarq3uv9fwnzbnre07mwqcewcnlbhdaenckceoooqrbioiigngoshwsvq7d0rimvbbnuqjf6m+pxhfsgnjn/fcvcshn9mlu7jphwv0my4reovt6iqs+mbdalpjwfliv6qbkrr988sne5zfhcvmkjq920qvlyohkgakqluzjcncq9qnpu05ion08vepbtzqsgijrojhfryrvydyfes5igodjjeaygmvfp/zepmktr2c8jrthopjoihjucww7awgvbcs2egthaxvf4v4gatcsrdx1yxy0yfpeueoeda0b44brfoqjrrybfvgenjgbltanwgdb2dwcj7bk3gznowx4934metnjgpmb/yb8fkdboiarw==</latexit> <latexit sha1_base64="r0rtd7kbj5xxwjrmrjkcyez+9gi=">aaaccxicbvdnssnagnz4w+tf1kox1sj4kokikigu9ocxgrgfjobnztmu3wzc7kyoiwcvvooxdypefqnvvo2bngdthvgyzubj22+clfgplovbmjtfwfxarq3uv9fwnzbnre07mwqcewcnlbhdaenckceoooqrbioiigngoshwsvq7d0rimvbbnuqjf6m+pxhfsgnjn/fcvcshn9mlu7jphwv0my4reovt6iqs+mbdalpjwfliv6qbkrr988sne5zfhcvmkjq920qvlyohkgakqluzjcncq9qnpu05ion08vepbtzqsgijrojhfryrvydyfes5igodjjeaygmvfp/zepmktr2c8jrthopjoihjucww7awgvbcs2egthaxvf4v4gatcsrdx1yxy0yfpeueoeda0b44brfoqjrrybfvgenjgbltanwgdb2dwcj7bk3gznowx4934metnjgpmb/yb8fkdboiarw==</latexit> <latexit sha1_base64="r0rtd7kbj5xxwjrmrjkcyez+9gi=">aaaccxicbvdnssnagnz4w+tf1kox1sj4kokikigu9ocxgrgfjobnztmu3wzc7kyoiwcvvooxdypefqnvvo2bngdthvgyzubj22+clfgplovbmjtfwfxarq3uv9fwnzbnre07mwqcewcnlbhdaenckceoooqrbioiigngoshwsvq7d0rimvbbnuqjf6m+pxhfsgnjn/fcvcshn9mlu7jphwv0my4reovt6iqs+mbdalpjwfliv6qbkrr988sne5zfhcvmkjq920qvlyohkgakqluzjcncq9qnpu05ion08vepbtzqsgijrojhfryrvydyfes5igodjjeaygmvfp/zepmktr2c8jrthopjoihjucww7awgvbcs2egthaxvf4v4gatcsrdx1yxy0yfpeueoeda0b44brfoqjrrybfvgenjgbltanwgdb2dwcj7bk3gznowx4934metnjgpmb/yb8fkdboiarw==</latexit> Running time of Value Iteration Value iteration isn t fast, but if the state space is small, it s manageable. The problem is state spaces can be exponentially larger than they seem. If the agent s state has k variables with domains ky D 1,, D k to keep track of, there can be D i distinct states. i=1 Even in the best case (binary variables), this is expontntial (2 k states).

How big is the state space? In gridworld, we can track the state space with two variables row and column. This is quite manageable. How big is the state space for realistic MDPs? Search and rescue robot Mars rover What variables do these agents need to keep track of?

Mars rover state space Some possible state space variables: position: x,y,z (large domains) environmental factors sunlight wind temperature robot controls battery level arm position solar panel angle drill operation mission objectives current experiment samples being analyzed etc. This state space is gigantic!

MDPs vs. State Space Search In state space search, we might also have state variables and factored state spaces. Why didn t we worry about the fact that the state space could be exponentially large then?

Handling huge state spaces Full value iteration is implausible. We could try to reason about the factored state space. Plan in terms of variables instead of states. One could spend an entire month on this topic (we won t). Approximation Online planning Reinforcement learning

Online Planning Key idea: plan for one state at a time. We may be able to come up with a good action for the current state without solving the entire MDP. We can plan ahead some, but we know that we re going to stop and think about plans again later. Where have we seen something like this before?

Thinking about online planning. How can we use ideas we ve already seen to help with online planning? Heuristics? Iterative deepening? Monte Carlo simulations? Other ideas?