Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Similar documents
Approximate Q-Learning. Dan Weld / University of Washington

CS 188: Artificial Intelligence Spring Announcements

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning Wrap-up

Markov Decision Processes Chapter 17. Mausam

RL 14: POMDPs continued

CS 188: Artificial Intelligence

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Reinforcement Learning and Control

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

CS 4100 // artificial intelligence. Recap/midterm review!

CS 7180: Behavioral Modeling and Decisionmaking

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning

CSEP 573: Artificial Intelligence

Markov Decision Processes Chapter 17. Mausam

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Lecture 1: March 7, 2018

Markov Decision Processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Introduction

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Reinforcement Learning

Reinforcement learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Artificial Intelligence

Autonomous Helicopter Flight via Reinforcement Learning

Reinforcement Learning Active Learning

Planning in Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CSE 473: Artificial Intelligence Spring 2014

Temporal Difference Learning & Policy Iteration

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

REINFORCEMENT LEARNING

CS 570: Machine Learning Seminar. Fall 2016

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

, and rewards and transition matrices as shown below:

Some AI Planning Problems

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CS599 Lecture 1 Introduction To RL

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. George Konidaris

Reinforcement learning an introduction

Reinforcement Learning: An Introduction

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CSC321 Lecture 22: Q-Learning

Sequential Decision Problems

Announcements. CS 188: Artificial Intelligence Fall Adversarial Games. Computing Minimax Values. Evaluation Functions. Recap: Resource Limits

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Lecture 23: Reinforcement Learning

CS 188: Artificial Intelligence Fall Announcements

CSE 473: Artificial Intelligence Autumn Topics

An Introduction to Reinforcement Learning

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Q-Learning in Continuous State Action Spaces

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Decision Theory: Q-Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning II

Factored State Spaces 3/2/178

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

6 Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Markov Decision Processes (and a small amount of reinforcement learning)

An Adaptive Clustering Method for Model-free Reinforcement Learning

Decision Theory: Markov Decision Processes

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Grundlagen der Künstlichen Intelligenz

Heuristic Search Algorithms

Food delivered. Food obtained S 3

16.4 Multiattribute Utility Functions

Intelligent Systems (AI-2)

Artificial Intelligence & Sequential Decision Problems

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Distributed Optimization. Song Chong EE, KAIST

Elements of Reinforcement Learning

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Chapter 3: The Reinforcement Learning Problem

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Internet Monetization

CSE250A Fall 12: Discussion Week 9

CS181 Midterm 2 Practice Solutions

Transcription:

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer or Andrew Moore 1 Recap: MDPs Markov decision processes: States S Actions A Transitions T(s,a,sʼ) aka P(sʼ s,a) Rewards R(s,a,sʼ) (and discount ) Start state s 0 (or distribution P 0 0) Algorithms Value Iteration Q-value iteration Quantities: Policy = map from states to actions Utility = sum of discounted future rewards Q-Value = expected utility from a q-state Ie. from a state/action pair a s s, a s,a,sʼ sʼ Andrey Markov (1856 1922) Bellman Equations Q*(a, s) = 4 Bellman Backup Q-Value Iteration V 5 = 6.5 a 1 s 1 5 s a 0 2 s 2 V 4 = 0 V 4 = 1 Q 5 (s,a 1 ) = 2 + 0 ~ 2 Q 5 (s,a 2 ) = 5 + 0.9~ 1 + 0.1~ 2 ~ 6.1 Q 5 (s,a 3 ) = 4.5 + 2 ~ 6.5 Regular Value iteration: find successive approx optimal values Start with V 0* (s) = 0 Given V i*, calculate the values for all states for depth i+1: Q i+1 (s,a) a 3 max s 3 V 4 = 2 Storing Q-values is more useful! Start with Q 0* (s,a) = 0 Given Q i*, calculate the q-values for all q-states for depth i+1: V i (s ) ] 1

Q-Value Iteration Initialize each q-state: Q 0 (s,a) = 0 Repeat For all q-states, s,a Compute Q i+1 (s,a) from Q i by Bellman backup at s,a. Until max s,a Q i+1 (s,a) Q i (s,a) < V i (s ) ] Reinforcement Learning Markov decision processes: States S Actions A Transitions T(s,a,sʼ) aka P(sʼ s,a) Rewards R(s,a,sʼ) (and discount ) Start state s 0 (or distribution P 0 0) Algorithms Q-value iteration Q-learning a s s, a s,a,sʼ sʼ Approaches for mixing exploration & exploitation -greedy Exploration functions Applications Robotic control helicopter maneuvering, autonomous vehicles Mars rover - path planning, oversubscription planning elevator planning Game playing - backgammon, tetris, checkers Neuroscience Computational Finance, Sequential Auctions Assisting elderly in simple tasks Spoken dialog management Communication Networks switching, routing, flow control War planning, evacuation planning Stanford Autonomous Helicopter http://heli.stanford.edu/ 10 Two main reinforcement learning approaches Model-based approaches: explore environment & learn model, T=P(sʼ s,a) and R(s,a), (almost) everywhere use model to plan policy, MDP-style approach leads to strongest t theoretical ti results often works well when state-space is manageable Model-free approach: donʼt learn a model; learn value function or policy directly weaker theoretical results often works better when state space is large Two main reinforcement learning approaches Model-based approaches: Learn T + R S 2 A + S A parameters (40,000) Model-free approach: Learn Q S A parameters (400) 2

Recap: Sampling Expectations Want to compute an expectation weighted by P(x): Recap: Exp. Moving Average Exponential moving average Makes recent samples more important Model-based: estimate P(x) from samples, compute expectation Model-free: estimate expectation directly from samples Forgets about the past (distant past values were wrong anyway) Easy to compute from the running average Why does this work? Because samples appear with the right frequencies! Decreasing learning rate can give converging averages Q-Learning Update Q-Learning = sample-based Q-value iteration How learn Q*(s,a) values? Receive a sample (s,a,s asʼ,r) Consider your old estimate: Consider your new sample estimate: Incorporate the new estimate into a running average: Exploration-Exploitation tradeoff You have visited part of the state space and found a reward of 100 is this the best you can hope for??? Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge? at risk of missing out on a better reward somewhere Exploration: should I look for states w/ more reward? at risk of wasting time & getting some negative reward 16 Exploration / Exploitation Q-Learning: Greedy Several schemes for action selection Simplest: random actions ( greedy) Every time step, flip a coin With probability, act randomly With probability bilit 1-, act according to current policy Problems with random actions? You do explore the space, but keep thrashing around once learning is done One solution: lower over time Another solution: exploration functions QuickTime and a H.264 decompressor are needed to see this picture. 3

Exploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Q-Learning Final Solution Q-learning produces tables of q-values: Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) Exploration policy π(s )= vs. Q-Learning Properties Amazing result: Q-learning converges to optimal policy If you explore enough If you make the learning rate small enough but not decrease it too quickly! Not too sensitive to how you select actions (!) Neat property: off-policy learning learn optimal policy without following it (some caveats) S E S E Q-Learning Small Problem Doesn t work In realistic situations, we can t possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we need to generalize: Learn about a few states from experience Generalize that experience to new, similar states (Fundamental idea in machine learning) Example: Pacman Letʼs say we discover through experience that this state is bad: In naïve Q learning, we know nothing about related states and their Q values: Or even this third one! Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot) 2 Is Pacman in a tunnel? (0/1) etc. Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 4

Linear Feature Functions Using a feature representation, we can write a q function (or value function) for any state using a linear combination of a few weights: Function Approximation Q-learning with linear q-functions: Advantage: our experience is summed up in a few powerful numbers S 2 A? S A? Disadvantage: states may share features but actually be very different in value! Intuitive interpretation: Adjust weights of active features E.g. if something unexpectedly bad happens, disprefer all states with that stateʼs features Formal justification: online least squares Exact Qʼs Approximate Qʼs Example: Q-Pacman Linear Regression 40 26 24 20 22 20 0 0 20 30 20 10 0 0 10 20 30 40 Prediction Prediction Ordinary Least Squares (OLS) Minimizing Error Imagine we had only one point x with features f(x): Observation Error or residual Prediction Approximate q update: target prediction 0 0 20 5

Overfitting Which Algorithm? 30 25 Q-learning, no features, 50 learning trials: 20 Degree 15 polynomial 15 10 5 QuickTime and a GIF decompressor are needed to see this picture. 0-5 -10-15 0 2 4 6 8 10 12 14 16 18 20 Which Algorithm? Q-learning, no features, 1000 learning trials: Which Algorithm? Q-learning, simple features, 50 learning trials: QuickTime and a GIF decompressor are needed to see this picture. QuickTime and a GIF decompressor are needed to see this picture. Partially observable MDPs A POMDP: Ghost Hunter Markov decision processes: States S Actions A Transitions P(sʼ s,a) (or T(s,a,sʼ)) Rewards R(s,a,sʼ) (and discount ) Start state distribution b 0 =P(s 0 ) a b, a b QuickTime and a H.264 decompressor are needed to see this picture. POMDPs, just add: Observations O Observation model P(o s,a) (or O(s,a,o)) o bʼ 6

POMDP Computations Sufficient statistic: belief states bo=pr(so) a b, a POMDPs search trees bʼ max nodes are belief states expectation nodes branch on possible observations (this is motivational; we will not discuss in detail) o b Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic POMDPs partially observable stochastic 38 Classical Planning MDP-Style Planning hell hell World deterministic State observable Policy Universal Plan Navigation function World stochastic State observable 39 40 Stochastic, Partially Observable Stochastic, Partially Observable? hell??? hell?? hell start start 50% 50% 41 42 7

Stochastic, Partially Observable Stochastic, Partially Observable hell hell hell?? hell start 50% 50% 43 44 Notation (1) Recall the Bellman optimality equation: V ( s) max aa( s) s Throughout this section we assume R a ss 1 a Rs a a P R V ( s) ss ss 1 r( s, a) is independent of so s' that the Bellman optimality equation turns into a a a V ( s) maxr s V ( s ) Pss max r( s, a) V ( s ) Pss aa( s) aa( s) s s 45 Notation (2) In the remainder we will use a slightly different notation for this equation: According to the previously used notation we would write V ( s ) max r ( s, a ) V ( s ) P a A ( s ) s We replaced s by x and a by u, and turned the sum into an integral. a s s 46 Value Iteration Given this notation the value iteration formula is with POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the state under consideration. POMDPs compute a value function over belief spaces: 47 48 8

Problems Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution. This is problematic, since probability distributions are continuous. Additionally, we have to deal with the huge complexity of belief spaces. For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions. An Illustrative Example measurements state x 1 action u 3 state x 2 measurements 0.2 0.8 z 0.7 1 u x 3 1 z u 3 2 0.3 0.8 0.2 payoff x 2 0.3 0.7 u1 u2 u1 u2 100 100 100 50 payoff z 1 z 2 actions u 1, u 2 49 50 The Parameters of the Example The actions u 1 and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and =1. Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states: 51 52 Payoffs in Our Example (1) If we are totally certain that we are in state x 1 and execute action u 1, we receive a reward of -100 Payoffs in Our Example (2) If, on the other hand, we definitely know that we are in x 2 and execute u 1, the reward is +100. In between it is the linear combination of the extreme values weighted by their probabilities biliti 53 54 9

The Resulting Policy for T=1 Given we have a finite POMDP with T=1, we would use V 1 (b) to determine the optimal policy. In our example, the optimal policy for T=1 is Piecewise Linearity, Convexity The resulting value function V 1 (b) is the maximum of the three functions at each point This is the upper thick graph in the diagram. 55 It is piecewise linear and convex. 56 Pruning If we carefully consider V 1 (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V 1 (b). Increasing the Time Horizon If we go over to a time horizon of T=2, the agent can also consider the sensing action u 3. Suppose we perceive z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. Thus V 1 (b z 1 ) is given by 57 58 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief Resulting Value Function The four possible combinations yield the following function which again can be simplified and pruned. 59 60 10

State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account. Resulting Value Function after executing u 3 Taking also the state transitions into account, we finally obtain. 61 62 Value Function for T=2 Taking into account that the agent can either directly perform u 1 or u 2, or first u 3 and then u 1 or u 2, we obtain (after pruning) Graphical Representation of V 2 (b) u 1 optimal unclear u 2 optimal outcome of measuring is important here 63 64 Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=10 and T=20 are 65 Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than 10 547,864 linear functions. At T=30 we have 10 561,012,337 linear functions. The pruned value functions at T=20, in comparison, contains only 12 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications. 66 11

A Summary on POMDPs POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions. 67 12