Markov Decision Processes (and a small amount of reinforcement learning)

Similar documents
Reinforcement Learning. George Konidaris

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Decision Theory: Markov Decision Processes

Decision Theory: Q-Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CS 7180: Behavioral Modeling and Decisionmaking

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning and Control

Introduction to Reinforcement Learning

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS 570: Machine Learning Seminar. Fall 2016

Planning in Markov Decision Processes

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Reinforcement Learning

REINFORCEMENT LEARNING

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Lecture 23: Reinforcement Learning

Some AI Planning Problems

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Temporal difference learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Markov decision processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS599 Lecture 1 Introduction To RL

Q-Learning in Continuous State Action Spaces

Chapter 16 Planning Based on Markov Decision Processes

Reinforcement Learning: An Introduction

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement learning an introduction

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Lecture 3: Markov Decision Processes

Reinforcement Learning: the basics

CS 4100 // artificial intelligence. Recap/midterm review!

Probabilistic Planning. George Konidaris

Reinforcement Learning. Introduction

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Open Theoretical Questions in Reinforcement Learning

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning Active Learning

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Temporal Difference Learning & Policy Iteration

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning and Deep Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning II

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

6 Reinforcement Learning

Artificial Intelligence

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Q-Learning for Markov Decision Processes*

A Gentle Introduction to Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Multiagent Value Iteration in Markov Games

, and rewards and transition matrices as shown below:

Reinforcement Learning

Elements of Reinforcement Learning

Reinforcement learning

Least squares policy iteration (LSPI)

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Efficient Learning in Linearly Solvable MDP Models

MDP Preliminaries. Nan Jiang. February 10, 2019

Grundlagen der Künstlichen Intelligenz

Markov Decision Processes

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Internet Monetization

Reinforcement learning

Machine Learning I Reinforcement Learning

16.4 Multiattribute Utility Functions

(Deep) Reinforcement Learning

Algorithms for MDPs and Their Convergence

Autonomous Helicopter Flight via Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Prioritized Sweeping Converges to the Optimal Value Function

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Introduction to Markov Decision Processes

CSE250A Fall 12: Discussion Week 9

Distributed Optimization. Song Chong EE, KAIST

University of Alberta

Transcription:

Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session 23 How Should a Rover Search for its Landing Craft? Landing Craft State Space Search? As a Constraint Satisfaction Problem? oal-directed Planning? Linear Programming? Is the real world well-behaved? 1

How Should a Rover Search for its Landing Craft? Landing Craft What if each action can have one of a set of different outcomes? What if the outcomes occur probabilistically? 2

Ideas in this lecture Problem is to accumul ate rewards, rather than to achieve goal states. Approach is to generate reactive policies for how to act in all situations, rather than plans for a single starting situation. Policies fall out of value functions, which describe the greatest lifetime reward achievable at every state. Value functions are iteratively approximated. MDP Problem: Model Agent State Reward Action Environment a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 iven an environment model as a MDP create a policy for acting that maximizes lifetime reward 3

Markov Decision Processes ( MDPs) Model: Finite set of states, S Finite set of actions, A (Probabilistic) state transitions, Τ(s i,a j, s k ) Reward for each state and action, R(s i,a i ) Example: s1 a1 Process: Observe state s t in S Choose action at in A Receive immediate reward r t State changes to some s t +1 accordi ng to T (s t, a t, s t +1) a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 Legal transitions shown Rewards on unlabeled transitions are. MDP Environment Assumptions Markov Assumption: Next state and reward is a function only of the current state and action: p(s t +1 a t, s t, a t-1, s t-1, a t-2,...) = p(s t +1 a t, s t ) r(s t, a t, s t-1, a t-1, s t-2,...) = r (s t, a t ) Uncertain and Unknown Environment: p(s t +1 a t, s t ) and r may be nondeterministic and unknown 4

So what is the solution to an MDP? An MDP solution is a policy π : S A Selects an action for each state. Optimal policy π : S A Selects action for each state that maximizes lifetime reward. π π Assume deterministic world There are many policies, not all are necessarily optimal. There may be several optimal policies. 5

What is this lifeti me reward? Optimal policy maximizes expected reward of agent over the lifetime of the agent. " * $ (s) = argmax E s,s 1,...[ % r(s t,a t )dt a #A t= ] How long will the agent live? Finite horizon: Rewards accumulate for a fixed period. $K + $K + $K = $3K Infinite horizon: Assume reward accumulates for ever $K + $K +... = infinity Discounting: Future rewards not worth as much ( a bird i n hand ) Introduce discount factor γ $K + γ $K + γ 2 $K... converges Value Function V π for a iven Policy π V π (s t ) is the accumulated lifetime reward resulting from starting in state s t and repeatedly executing policy π: V π (s t ) = r t + γ r t+1 + γ 2 rt+2... V π (s t ) = i γ i r t+i where r t, r t+1, r ll i π, π t+2... are generated by fo ow ng starting at s t. V π γ 9 9 Assume =.9 6

An Optimal Policy π* iven Value Function V* Notice: Suppose, given state s, we knew the lifetime rewards for all other states? 1. Examine all possible actions a i in state s. 2. Select action a i with greatest lifetime reward. Lifeti me reward Q( s, a i ) is: the immediate reward for taking action r( s,a) probability of posterior state s : p(s s,a) life time reward starting in target state V( s' ) discounted by γ. π*(s) = argmaxa [r( s,a) + γv ( p( s, a) )] Must Know: Value function Environment model. p : S x A x S R r : S x A R π 9 9 Value Function V for an optimal policy π Example R A A B S A S B Optimal value function for a one step horizon: V* 1(s) = maxa i [ r(s, a i )] Optimal value function for a two step horizon: V* 2(s) = maxa i [ r(s, a i ) + γ s V* 1 (s )p(s s, a i )] Optimal value function for an n step horizon: V* n(s) = maxa i [ r(s, a i ) + γ s V* n-1(s )p(s s, a i )] Optimal value function for an infinite horizon: V* (s) = maxa i [r( s, a i ) + γ s V*(s )p(s s, a i )] R A R B A R B B 7

Solving MDPs by Value Iteration Insight: Calculate optimal values iteratively using DP Algorithm: 1. Label all states: for each state s V (s) max a r(s, a) 2. Iteratively calculate value using Bellman s Equation: for each state s V t+1 (s) max a [r(s,a) + γ s V t (s )p(s s, a)] 3. Terminate when values are close enough V t+1 (s) - V t (s) < ε 4. Return V* = V t+1 Policy Execution: agent selects optimal action by one step lookahead on V: π(s) = argmax a [r(s,a) + γ s V t(s (s )p(s s, a)] Example of Value Iteration V t +1(s) maxa [r( s,a ) + γ s V t s,a γ =.9, p(s s, a) is deterministic (s )p(s )] V t 9 V t+1 9 9 81 9 8

Example of Value Iteration V t +1(s) maxa [r( s,a ) + γ s V t s,a (s )p(s )] γ =.9, p(s s, a) is non-deterministic ( red arcs occur with 5% probabili ty) V 1 V 2 V 3 5 45 5 81.45 9.5 9 81 9 Convergence of Value Iteration If terminate when val ues are cl ose enough V (s) - V t (s) < ε t+1 Then: Maxs i V (s) - V n S t+1 (s) < 2εγ /(1 - γ) Converges in polynomial time. Convergence guaranteed even if updates are performed infinitely often, but asynchronously and in any order. 9

Ideas in this lecture Objective is to accumulate rewards, rather than goal states. Objectives are achieved along the way, rather than at the end. Policies can be described by value functions, which describe the greatest lifetime reward achievable at every state. Value iteration is a fast algorithm for computing the value function under certain assumptions. Appendix: Policy Iteration Idea: Iteratively improve the policy 1. Policy Evaluation: iven a policy π i calculate V i = V πi, the utility of each state if π i were to be executed. 2. Policy Improvement: Calculate a new maximum expected utility policy π i using one-step look ahead based on V i. +1 π i improves at every step, converging if π i = π i. Computing V i is simpler than for Value iteration (no max): V* t+1 (s) r( s, π i (s)) + γ s V* t (s )p(s s, π i (s))] Solve linear equations i n O(N 3 ) Solve iteratively, similar to value iteration. +1

Reinforcement Learning Problem iven: Repeatedly Executed action Observed state Observed reward Agent Learn action policy π: S A Maximizes life reward r + γ r 1 + γ 2 r 2... from any start state. Di scount: < γ < 1 Note: Unsupervised learning Delayed reward Model not known State Reward Action Environment a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 oal: Learn to choose actions that maximize life reward r + γ r 1 + γ 2 r 2... How About Learning the Model? Certainty Equivalence 1. Explore the world 2. Count how often reward r occurs when in state s: SumR[s i] r, Count[s i] 1. 3. Count how often state s occurs when in state s Trans[Si,Sj] 1. 4. At any time: r est (S j ) = SumR[S i] / Count[S i] T est ij = Trans[Si,Sj] / Count[Si] 5. So at any time we can solve for V est 11

Certainty Equivalence Costs Memory: O(N 2 ) Ti me to update counters: O(1) Time to re-evaluate V O(N 3 ) if use matrix inversion O(N 2 k CRIT) if use value iteration and we need kcrititerations to converge O( Nk CRIT ) if use value iteration, and k CRIT to converge, and T is Sparse(i.e. mean # successors is constant) Too expensive for some people. Prioritized sweeping will hel p, ( see l ater ), but first l et s review a very inexpensive approach Eliminating the Model with Q Functions π*(s) = argmaxa [r( s,a ) + γ s V* t (s )p(s s, a)] Key idea: Define function like V that encapsulates δ and r: Q( s,a) = r( s,a) + γ s V* t (s )p(s s, a) Then, if agent learns Q, it can choose an optimal action without knowing δ or r. π*(s) = argmaxa Q( s,a ) 12

How Do We Learn Q? Q(s t,a t ) = r(s t,a t ) + γ s V* t (s )p(s s, a t ) Need to eliminate V* In update rule. Note Q and V* are closely related: V*(s) = maxa Q(s,a ) Substituting Q for V*: Q(s t,a t ) = r(s t,a t ) + γ maxa Q(s,a ) γ =.9 Example Q R 72 63 81 Learning Update 9 63 R 81 Q(s 1,a ri ) r(s 1 ri ) + γ a Q(s ght,a ght max 2,a ) +.9 max { 63, 81, } 9 Note: if rewards are non-negative: For all s, a, n, Q n (s, a) Q n+1( s, a ) For all s, a, n, Q n (s, a) Q( s, a) 13

Q-Learning Iterations Starts at top left corner move clockwise around perimeter; Initially all values in Q table are zero; γ =.8 Q(s, a) r+ γ max a Q(s,a ) s1 s2 s3 s6 s5 s4 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ γ max a {Q(s5,loop)}= +.8 x = r+ γ max a {Q(s4,W), Q(s4,N)} = +.8 x max{,) = 8 r+ γ max a {Q(s3,W), Q(s3,S)} = +.8 x max{,8) = 6.4 8 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a) Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q (s, a) as follows: Q(s, a) r+ γ max a Q(s,a ) s s 14

Discussion How should the learning agent use the intermediate Q values? Exploration vs Exploitation Scaling up in the size of the state space Function approxi mators ( neural net instead of table) Reuse, use of macros NonDeterministic Case We redefine V, Q by taking expected values V π (s t ) = E[r t + γ r t+1 + γ 2 r t+2... ] V π (s t ) = E[ γ i r t+i ] Q(s t,a t ) = E[r(s t,a t ) + γv (δ(s t, a t ))] 15

Nondeterministic Case Alter training rule to Q n ( s, a) ( 1- α n ) Q n-1 ( s,a ) + α n [ r+ γ max a Q n- (s,a )] 1 where α n = 1/( 1+visitsn ( s,a)) and s = δ( s, a). Can still prove convergence of Q [ Watkins and Dayan, 92] Ongoing Research Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action, state Learn and use δ : S x A S Relationship to dynamic programming Multiple l earners Multi-agent reinforcement learning 16