Markov decision processes and interval Markov chains: exploiting the connection

Similar documents
Interval Markov chains: Performance measures and sensitivity analysis

, and rewards and transition matrices as shown below:

The quest for finding Hamiltonian cycles

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Infinite-Horizon Average Reward Markov Decision Processes

Markov Model. Model representing the different resident states of a system, and the transitions between the different states

Discrete planning (an introduction)

MDP Preliminaries. Nan Jiang. February 10, 2019

Lecture notes for Analysis of Algorithms : Markov decision processes

Infinite-Horizon Discounted Markov Decision Processes

Decision Theory: Markov Decision Processes

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Lecture 3: Markov Decision Processes

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Stochastic Shortest Path Problems

(b) What is the variance of the time until the second customer arrives, starting empty, assuming that we measure time in minutes?

Reinforcement Learning and Deep Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning

Markov decision processes

Power Allocation over Two Identical Gilbert-Elliott Channels

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Probabilistic Planning. George Konidaris

RECURSION EQUATION FOR

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

AN ABSTRACT OF THE THESIS OF

Linear Dependence of Stationary Distributions in Ergodic Markov Decision Processes

Control Theory : Course Summary

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Stochastic Safest and Shortest Path Problems

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Markov Decision Processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

AM 121: Intro to Optimization Models and Methods: Fall 2018

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning

Motivation for introducing probabilities

Classification of Countable State Markov Chains

Markov Decision Processes

Distributed Optimization. Song Chong EE, KAIST

Robustness of policies in Constrained Markov Decision Processes

MATH 564/STAT 555 Applied Stochastic Processes Homework 2, September 18, 2015 Due September 30, 2015

CS 7180: Behavioral Modeling and Decisionmaking

21 Markov Decision Processes

The Markov Decision Process (MDP) model

A Shadow Simplex Method for Infinite Linear Programs

MATH4406 Assignment 5

Optimism in the Face of Uncertainty Should be Refutable

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Game Theory and its Applications to Networks - Part I: Strict Competition

Optimisation and Operations Research

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Some notes on Markov Decision Theory

STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

STOCHASTIC PROCESSES Basic notions

x x 1 0 1/(N 1) (N 2)/(N 1)

Markov Chains and Stochastic Sampling

1 Markov decision processes

Markov Decision Processes and Dynamic Programming

MARKOV DECISION PROCESSES

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes

GMDPtoolbox: a Matlab library for solving Graph-based Markov Decision Processes

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

Minimum average value-at-risk for finite horizon semi-markov decision processes

Risk-Sensitive and Average Optimality in Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes


Markov Chains. Chapter 16. Markov Chains - 1

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Markov Decision Processes and Dynamic Programming

Linear Programming Methods

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Point Process Control

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

Decentralized Control of Stochastic Systems

Reinforcement learning an introduction

The L-Shaped Method. Operations Research. Anthony Papavasiliou 1 / 38

FEL3330 Networked and Multi-Agent Control Systems. Lecture 11: Distributed Estimation

Dantzig s pivoting rule for shortest paths, deterministic MDPs, and minimum cost to time ratio cycles

Influence of product return lead-time on inventory control

Language Acquisition and Parameters: Part II

Reinforcement Learning. Yishay Mansour Tel-Aviv University

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Planning and Acting in Partially Observable Stochastic Domains

Decision Theory: Q-Learning

Markov Decision Processes Infinite Horizon Problems

Markov chains. Randomness and Computation. Markov chains. Markov processes

A linear programming approach to nonstationary infinite-horizon Markov decision processes

Notation. state space

Transcription:

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013

Intervals and interval arithmetic Intervals Markov chains Problem We use the notation X = [ X, X ] to represent an interval Interval arithmetic allows us to perform arithmetic operations on intervals and can be represented as follows X Y = {x y : x X, y Y } where X and Y represent intervals and is the arithmetic operator

Intervals and interval arithmetic Intervals Markov chains Problem Let X = [ 1, 1]. Then we have X 2 = {x 2 : x [ 1, 1]} = [0, 1] whilst X X = {x 1 x 2 : x 1 [ 1, 1], x 2 [ 1, 1]} = [ 1, 1]. So here, we have the idea of one-sample and re-sample.

Intervals Markov chains Problem Computation with interval arithmetic Computational software, e.g. INTLAB Performs arithmetic operations on interval vectors and matrices Solves systems of linear equations with intervals

Intervals Markov chains Problem Why might interval arithmetic be useful? Point estimate of parameters with sensitivity analysis Can we avoid the need for sensitivity analysis? Is it possible to directly incorporate the uncertainty of parameter values into our model? Intervals can be used to bound our parameter values, [x error, x + error]

Markov chains + intervals =? Intervals Markov chains Problem Consider a discrete time Markov chain with n + 1 states, {0,..., n}, and state 0 an absorbing state Interval transition probability matrix P = [1, 1] [0, 0] [0, 0] [ ] P10, P 10. P s [ ] Pn0, P n0

Intervals Markov chains Problem Conditions on the interval transition probability matrix Bounds are valid probabilities, 0 P ij P ij 1 Row sums must satisfy the following, P ij 1 j j P ij

Time homogeneity Background Intervals Markov chains Problem Standard Markov chains: One-step transition probability matrix, P, constant over time Interval Markov chains: Time inhomogeneous interval matrix Time homogeneous interval matrix One-sample (Time homogeneous Markov chain) Re-sample (Time inhomogeneous Markov chain)

Intervals Markov chains Problem Hitting times and mean hitting times N i is the random variable describing the number of steps required to hit state 0 conditional on starting in state i ν i = E[N i ] is expected number of steps needed to hit state 0 conditional on starting in state i

Hitting times problem Intervals Markov chains Problem We want to calculate an interval hitting times vector, [ν, ν], for our interval Markov chain. That is, we want to solve [ν, ν] = (I P s ) 1 1 where I is the identity matrix, 1 is vector of ones, P s is sub-matrix of the interval matrix P and ν and ν represent the lower and upper bounds of the hitting times vector.

Intervals Markov chains Problem Can we solve the system of equations directly? Can we just use INTLAB and interval arithmetic to solve the system of equations? INTLAB uses an iterative method to solve the system of equations Problem: ensuring the same realisation of the interval matrix is chosen at each iteration Problem: ensuring j P ij = 1

Hitting times interval Intervals Markov chains Problem We seek to calculate the interval hitting times vector of an interval Markov chain by minimising and maximising the hitting times vector, ν = (I P s ) 1 1, where P 11 P 1n P s =..... P 1n P nn is a realisation of the interval P s matrix with the row sums condition obeyed.

Maximisation case Background Intervals Markov chains Problem We wanted to solve the following maximisation problem for k = 1,..., n. subject to max ν k = [ ] (I P s ) 1 1 k n P ij = 1, for i = 1,..., n, j=0 P ij P ij P ij, for i = 1,..., n; j = 0,..., n.

New formulation of the problem Intervals Markov chains Problem subject to max ν k = [ ] (I P s ) 1 1 k n P ij = 1 P i0, for i = 1,..., n, j=1 P ij P ij P ij, for i, j = 1,..., n.

Intervals Markov chains Problem Feasible region of maximisation problem Constraints are row-based Let F i be the feasible region of row i, for i = 1,..., n Represents the possible vectors for the i th row of the P s matrix F i is defined by bounds and linear constraints which form a convex hull

What can we do with this? Intervals Markov chains Problem Numerical experience suggests the optimal solution occurs at a vertex of the feasible region Look to prove this conjecture using Markov decision processes (MDPs) We want to be able to represent our maximisation problem as an MDP and exploit existing MDP theory

Mapping Proof Conclusions What are Markov decision processes? A way to model decision making processes to optimise a pre-defined objective in a stochastic environment Described by decision times, states, actions, rewards and transition probabilities Optimised by decision rules and policies

Mapping Background Mapping Proof Conclusions Lemma Our maximisation problem is a Markov decision process restricted to only consider Markovian decision rules and stationary policies. Prove this by representing our maximisation problem as an MDP

Mapping Proof Conclusions Proof: states, decision times and rewards States Both representations involve the same underlying Markov chain Decision times Every time step of the underlying Markov chain Infinite-horizon MDP as we allow the process to continue until absorption Reward = 1 Each step increases the time to absorption by one

Proof: actions Background Mapping Proof Conclusions Recall, F i is the feasible region of row i We choose to let each vertex in F i correspond to an action of the MDP when in state i To recover the full feasible region, need convex combinations of vertices convex combinations of actions

Proof: transition probabilities Mapping Proof Conclusions Let P (a) i be the associated probability distribution vector for an action a When an action a is chosen in state i, the corresponding P (a) i is inserted into the i th row of the matrix, P s Considering all states i = 1,..., n, we get the P s matrix

Mapping Proof Conclusions Proof: Markovian decision rules and stationary policy Markovian decision rules Maximisation problem involves choosing the transition probabilities of a Markov chain Stationary policy We have a time homogeneous (one-sample) interval Markov chain Means optimal P s matrix remains constant over time Hence the choice of decision rule is independent of time

Optimal at vertex Background Mapping Proof Conclusions Theorem There exists an optimal solution of the maximisation problem where row i of the optimal matrix, P s, represents a vertex of F i for all i = 1,..., n. Need to show there is no extra benefit from having randomised decision rules as opposed to deterministic decision rules

Mapping Proof Conclusions Why do we care about randomised and deterministic? Randomised decision rules convex combination of actions non-vertex of F i Deterministic decision rules single action vertex of F i Want deterministic decision rules!

Proof Background Mapping Proof Conclusions Proposition (Proposition 6.2.1. of Puterman 1 ) For all v V, sup {r d + P d v} = sup {r d + P d v}. d D MD d D MR This proposition from Puterman 1 gives us that there is nothing to be gained from randomised decision rules So there exists an optimal is obtained for deterministic decision rules 1 M.L. Puterman. : Discrete Stochastic Dynamic Programming

Conclusions Background Mapping Proof Conclusions Proven that an optimal solution occurs at a vertex of the feasible region This theorem provides us with a useful analytic property which we can exploit when obtaining the optimal solution through numerical methods

What else? Background Mapping Proof Conclusions Determine if interval analysis can be used to investigate model sensitivity Vary width of intervals for parameters and see effect on mean hitting times intervals

Background?

Counter-example for an analytic solution Consider the following interval transition probability matrix, [1, 1] [0, 0] [0, 0] [0, 0] [0.3, 0.35] [0, 1] [0, 0] [0, 0.1] P =. [0.2, 0.3] [0, 1] [0, 1] [0, 1] [0.1, 0.2] [0, 1] [0, 0.3] [0, 0]

Counter-example for an analytic solution Our proposed analytic solution: 0.6 0 0.1 P s = 0 0 0.8. 0.6 0.3 0 Optimal solution obtained numerically from MATLAB: 0.6 0 0.1 Ps = 0 0.8 0. 0.6 0.3 0