AM 121: Intro to Optimization Models and Methods: Fall 2018

Similar documents
Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Reinforcement Learning. Introduction

Decision Theory: Markov Decision Processes

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Decision Theory: Q-Learning

Markov decision processes

, and rewards and transition matrices as shown below:

Lecture 3: Markov Decision Processes

Markov Decision Processes Chapter 17. Mausam

The Markov Decision Process (MDP) model

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Some AI Planning Problems

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CSE250A Fall 12: Discussion Week 9

MDP Preliminaries. Nan Jiang. February 10, 2019

Motivation for introducing probabilities

CS 7180: Behavioral Modeling and Decisionmaking

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Markov Decision Processes Infinite Horizon Problems

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

1 MDP Value Iteration Algorithm

Partially Observable Markov Decision Processes (POMDPs)

Lecture notes for Analysis of Algorithms : Markov decision processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Artificial Intelligence & Sequential Decision Problems

16.410/413 Principles of Autonomy and Decision Making

Optimism in the Face of Uncertainty Should be Refutable

Infinite-Horizon Average Reward Markov Decision Processes

Stochastic Primal-Dual Methods for Reinforcement Learning

Discrete planning (an introduction)

Understanding (Exact) Dynamic Programming through Bellman Operators

Infinite-Horizon Discounted Markov Decision Processes

CS 598 Statistical Reinforcement Learning. Nan Jiang

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 1: March 7, 2018

Reinforcement Learning and Control

Introduction to Reinforcement Learning

Some notes on Markov Decision Theory

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

Preference Elicitation for Sequential Decision Problems

Reinforcement Learning

Reinforcement Learning

Planning in Markov Decision Processes

21 Markov Decision Processes

Real Time Value Iteration and the State-Action Value Function

Artificial Intelligence

Chapter 16 Planning Based on Markov Decision Processes

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning and Deep Reinforcement Learning

Markov decision processes and interval Markov chains: exploiting the connection

Chapter 3: The Reinforcement Learning Problem

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

CAP Plan, Activity, and Intent Recognition

Markov Decision Processes (and a small amount of reinforcement learning)

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

The Reinforcement Learning Problem

REINFORCEMENT LEARNING

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Distributed Optimization. Song Chong EE, KAIST

Chapter 3: The Reinforcement Learning Problem

Probabilistic Planning. George Konidaris

Autonomous Helicopter Flight via Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Internet Monetization

Multiagent Value Iteration in Markov Games

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Elements of Reinforcement Learning

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Reinforcement Learning

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

1 Markov decision processes

Stochastic Safest and Shortest Path Problems

On the reduction of total cost and average cost MDPs to discounted MDPs

Reinforcement Learning. Machine Learning, Fall 2010

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Reinforcement learning

Reinforcement Learning

Reinforcement Learning. George Konidaris

Grundlagen der Künstlichen Intelligenz

CS599 Lecture 1 Introduction To RL

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Machine Learning I Reinforcement Learning

16.4 Multiattribute Utility Functions

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Transcription:

AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted reward LPs; value and policy iteration Example applications Airline meals, assisted living, automated vehicles / helicopters, Super Mario. Reading: Hillier and Lieberman, Introduction to Operations Research, McGraw Hill, p903-99 1

Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Problem: which action to take in each state?

Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Problem: which action to take in each state? Markov property: effect of an action only depends on current state, not prior history Example: Maintenance Four machine states {0,1,,3} Actions: do nothing, overhaul, replace; transitions depend on action Each action in each state is associated with a cost (negated reward). As follows: State Action Defective Maintenance Lost production 0 Nothing 0 0 0 1 Nothing 1000 0 0 1 Replace 0 4000 000 Nothing 3000 0 0 Overhaul 0 000 000 Replace 0 4000 000 3 Replace 0 4000 000 (Hillier and Lieberman) Possible next states To 1,, 3 To 1,, 3 To 0 To, 3 To 1 To 0 To 0 3

Maintenance Problem replace -$6000 State transition diagram 3/4 reward R(s,a) transition P(s,a,s ) 7/8 1 0 do nothing $0 overhaul -$4000 1/8 do nothing -$1000 1/8 1/16 1/ 1/ do nothing -$3000 replace -$6000 3 A Decision Policy A (deterministic) policy µ: S è A is a mapping from state to action. A stochastic policy µ: S è Δ(A) maps to a distribution on actions. A stationary policy is invariant to time. Focus on this (and assume infinite time horizon). 4

Following a Determinstic Policy Determine current state s Policy µ: S è A. Execute action µ(s) A. Fact: An MDP and a policy µ define a Markov chain. Defn. An MDP is ergodic if the assoc. Markov chain is ergodic for every determ policy. For an ergodic MDP, can define π µ (s ) as the steady-state prob of being in state s given µ. Evaluating a Policy How good is a policy µ from state s? Expected total reward? This may be infinite! How can we compare policies? 5

Value functions Definition. A value function V µ (s) provides a value for policy µ in each state s. Different ways to do this. With a finite decision horizon, we can sum the rewards. With an infinite decision horizon, we can adopt two different criteria : the limit-average reward, or the expected discounted reward. Limit-Average Reward Let V µ (t) (s) denote the total expected reward of policy µ for next t periods from state s Limit-average reward criterion: V µ (s) = lim [V µ (t) (s)/t] tè For an ergodic MDP, then V µ (s) = s R(s, µ(s )) π µ (s ), where π µ (s ) is steady-state prob of state s 6

Expected Discounted Reward A reward t steps away is discounted by γ t, for discount factor 0<γ<1 Models uncertainty about the time horizon, or current rewards more important than future. Expected discounted reward criterion: V µ (s) = E s,s, [ R(s, µ(s)) + ϒ R(s, µ(s )) + ϒ R(s, µ(s )) + ] Solving MDPs We can write an MDP problem as a linear program Solve the LP! The particular formulation depends on the objective criterion. 7

Limit-average Reward Given model M=(S,A,P,R), find policy µ that maximizes limit-average reward. Assume the MDP is ergodic Possible Decision Variables state action 0 1 n-1 0 x(0,0) x(0,1) x(0,n-1) 1 x(1,0) x(1,1) x(1,n-1) m-1 x(m-1,0) x(m-1,1) x(m-1,n-1) 1 if action a in state s x(s,a)= 0 otherwise Rows sum to 1 Defines prob action a in state s if stochastic 8

Better Decision Variables Define π(s,a) as the steady-state probability of being in state s and taking action a This defines a policy. To see this Better Decision Variables Define π(s,a) as the steady-state probability of being in state s and taking action a We have: π(s)= a π(s,a ) π(s,a)= π(s) x(s,a) (prob s * prob a given s) Defines a Policy: π(s,a) x(s,a)= a π(s,a ) 9

Balance Equation π(s,a): steady-state probability of being in state s and taking action a Balance equation: π(s ) = a π(s, a) = s π(in s, go to s ) = s a π(s, a) Pr(s, a, s ) Implies consistency in π values across (s,a) pairs LP formulation (Limit-average) V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) = s a π(s,a)p(s,a,s ) s (3) π(s,a) 0 s, a (4) 8 8 8 (1) Maximize expected average reward () Total probability sums to one (3) Balance equation Note: important here that the MDP be ergodic. 10

Optimal Policies are Deterministic V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) = s a π(s,a)p(s,a,s ) s (3) π(s,a) 0 s, a (4) 8 8 nm decision variables; m+1 equalities (one of balance equalities is redundant) simplex terminates with m basic variables π(s)>0 for each s since MDP ergodic; and so π(s,a)>0 for at least one a for each s. è π(s,a)>0 for exactly one a in each s è a deterministic policy. 8 Maintenance Problem State transition diagram reward R(s,a) transition P(s,a,s ) 0 7/8 do nothing $0 replace -$6000 overhaul -$4000 1 1/8 3/4 do nothing -$1000 1/8 1/16 1/ 1/ do nothing -$3000 replace -$6000 3 11

Example: Maintenance min 1000π(1,1) + 6000π(1,3) + 3000π(,1) + 4000π(,) + 6000π(,3) + 6000π(3,3) s.t. (minimization; because costs not rewards) π(0,1) + π (1,1) + π(1,3) + + π(3,3) = 1 π(0,1) (π(1,3)+π(,3) +π(3,3)) = 0 π(1,1) + π(1,3) (7/8π(0,1) +3/4π(1,1) + π(,)) = 0 π(,1)+π(,)+π(,3) (1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = 0 π(3,3) - (1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = 0 π(0,1),, π(3,3) 0 π * (0,1)=/1; π * (1,1)=5/7; π * (,)=/1; π * (3,3)=/1; rest zero. Optimal policy: µ(0)=1, µ(1)=1, µ()=, µ(3)=3; do nothing in 0 and 1, overhaul in, and replace in 3. Markov Chain of Optimal Poilicy 3/4 7/8 1 0 do nothing $0 overhaul -$4000 1/8 do nothing -$1000 1/8 1/16 replace -$6000 3 1

Expected discounted reward criterion Given model M=(S,A,P,R), find the policy µ that maximizes the expected discounted reward, for discount factor 0<γ<1. No need for MDP to be ergodic. LP formulation: Discounted Criterion V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) - s a π(s,a)p(s,a,s )=0 8 s (3) π(s,a) 0 s, a (4) 8 8 13

LP formulation: Discounted Criterion V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) - γ s a π(s,a)p(s,a,s )=β(s ) 8 s (3 ) π(s,a) 0 s, a (4) 8 8 For any β values s.t. s β(s)=1 and β(s)>0 for all s Policy: x(s,a) = π(s,a) / a π(s,a ) Interpretation: π(s,a) is the discounted exp number of time periods of being in state s, taking action a. β(s) is initial state distribution. Understanding this formulation It is the dual of another, easier to understand formulation. See this in a few slides! 14

Example: Maintenance Discount γ=0.9; arbitrarily set β=(¼, ¼, ¼, ¼ ) min 1000π(1,1) + 6000π(1,3) + 3000π(,1) + 4000π(,) + 6000π(,3) + 6000π(3,3) s.t. π(0,1) 0.9(π(1,3)+π(,3) +π(3,3)) = ¼ π(1,1) + π(1,3) 0.9(7/8π(0,1) +3/4π(1,1) + π(,)) = ¼ π(,1)+π(,)+π(,3) 0.9(1/16π(0,1)+1/8π(1,1)+½π(,1)) = ¼ π(3,3) - 0.9(1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = ¼ π(0,1),, π(3,3) 0 Optimal policy: π * (0,1)=1.1 π * (1,1)=6.66 π * (,)=1.07 π * (3,3)=1.07, rest all zero. (=> the same policy as for limitaverage criterion.) Fundamental Theorem of MDPs Theorem. Under discounted reward criterion, there exists a policy µ * that is uniformly optimal; i.e., it is optimal for all distributions on start states. Note 1: there need not be a unique, optimal policy. Note : The uniform optimality property is only true under the limit-average reward criterion for an ergodic MDP. 15

Policy and Value Iteration Another way to solve MDPs under the expected discounted reward criterion The Bellman equations Define consistency requirements for the case of a discounted reward criterion For any policy µ, the value function satisfies: V µ (s) = R(s, µ(s))+γ s S P(s,µ(s),s )V µ (s ) 8 s For the optimal policy: V * (s) = max a A [R(s,a) +γ s S P(s,a,s ) V * (s )] 8 s 16

Policy Iteration Improve a policy: µ (s):= arg max a R(s, a)+γ s P(s,a,s )V µ (s ) Evaluate a policy: V µ (s) = R(s, µ(s))+γ s S P(s,µ(s),s )V µ (s ) Policy iteration Ε Ι Ε Ι Ε Ι V µ0 V µ1 V µ µ 0 µ 1 µ 1 Each step provides a strict improvement in policy. Finite # policies. Converges! Value Iteration Just iterate on Bellman equations (for kth step): V k+1 (s):=max a [R(s,a) + γ s S P(s,a,s )V k (s )] Also converges. Less computationally intensive than policy iteration. 17

Understanding the Discounted reward LP formulation This LP: = Understanding the Discounted reward LP formulation Previous LP equivalent to (for positive beta values) and this is the dual! 18

Applications Airline meal provisioning Assisted living Automated vehicles Airline meal provisioning (Goto 00) Determine quantity of meals to load. Passenger load varies because of stand-by, missed flights Average costs down 17%, short-catered flights down 33% 19

Additional applications Assisted Living (Pollack 05) High speed obstacle avoidance (Michels et al. 05) Super Mario World Summary: MDPs Policy maps states to actions, fixing a policy we have a Markov chain Can adopt limit-average reward and expected discounted reward criteria Can solve for optimal policies via LPs. Solutions are: deterministic uniformly optimal (need ergodic if limit-average reward criterion) Can also use policy and value iteration 0