Markov Decision Processes

Similar documents
Lecture notes for Analysis of Algorithms : Markov decision processes

Stochastic Shortest Path Problems

Supplementary lecture notes on linear programming. We will present an algorithm to solve linear programs of the form. maximize.

Markov Chains and Stochastic Sampling

Trees. A tree is a graph which is. (a) Connected and. (b) has no cycles (acyclic).

Economics 2010c: Lecture 2 Iterative Methods in Dynamic Programming

Discrete Optimization

RECURSION EQUATION FOR

Kernels of Directed Graph Laplacians. J. S. Caughman and J.J.P. Veerman

Lecture 4 October 18th

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

MDP Preliminaries. Nan Jiang. February 10, 2019

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

CS261: Problem Set #3

1 Primals and Duals: Zero Sum Games

Problem set 1. (c) Is the Ford-Fulkerson algorithm guaranteed to produce an acyclic maximum flow?

1 Markov decision processes

MATH 409 Advanced Calculus I Lecture 7: Monotone sequences. The Bolzano-Weierstrass theorem.

III. Linear Programming

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Greedy Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 10

directed weighted graphs as flow networks the Ford-Fulkerson algorithm termination and running time

Chapter 29 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M.

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Sequences. We know that the functions can be defined on any subsets of R. As the set of positive integers

1 Locally computable randomized encodings

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Quitting games - An Example

CPSC 320 Sample Final Examination December 2013

Cover Page. The handle holds various files of this Leiden University dissertation

Lecture 3: graph theory

Modern Discrete Probability Branching processes

Duality of LPs and Applications

AM 121: Intro to Optimization

Course : Algebraic Combinatorics

Stochastic Dynamic Programming. Jesus Fernandez-Villaverde University of Pennsylvania

1 Computational Problems

The Complexity of Ergodic Mean-payoff Games,

Shortest paths with negative lengths

25.1 Markov Chain Monte Carlo (MCMC)

Markov decision processes and interval Markov chains: exploiting the connection

Markov Chains, Stochastic Processes, and Matrix Decompositions

Markov Chains Absorption (cont d) Hamid R. Rabiee

CMPSCI 611: Advanced Algorithms

a i,1 a i,j a i,m be vectors with positive entries. The linear programming problem associated to A, b and c is to find all vectors sa b

Notes for Lecture Notes 2

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES

On shredders and vertex connectivity augmentation

Physical Metaphors for Graphs

Markov Chains, Random Walks on Graphs, and the Laplacian

Fault-Tolerant Consensus

Chapter 16 Planning Based on Markov Decision Processes

Markov Decision Processes and Dynamic Programming

Space and Nondeterminism

A Parametric Simplex Algorithm for Linear Vector Optimization Problems

SMT 2013 Power Round Solutions February 2, 2013

1 Review Session. 1.1 Lecture 2

Reaching a Consensus in a Dynamically Changing Environment A Graphical Approach

Definition 2.3. We define addition and multiplication of matrices as follows.

ADVANCED CALCULUS - MTH433 LECTURE 4 - FINITE AND INFINITE SETS

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

We say that a flow is feasible for G (or just feasible if the graph and capacity function in question are obvious) if

Planning in Markov Decision Processes

The Skorokhod reflection problem for functions with discontinuities (contractive case)

In particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with

CO 250 Final Exam Guide

Math/Phys/Engr 428, Math 529/Phys 528 Numerical Methods - Summer Homework 3 Due: Tuesday, July 3, 2018

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 12 Luca Trevisan October 3, 2017

Lecture 2: Just married

A primal-simplex based Tardos algorithm

642:550, Summer 2004, Supplement 6 The Perron-Frobenius Theorem. Summer 2004

Markov Decision Processes Infinite Horizon Problems

Nash-solvable bidirected cyclic two-person game forms

Motivation for introducing probabilities

LIMITING PROBABILITY TRANSITION MATRIX OF A CONDENSED FIBONACCI TREE

What is SSA? each assignment to a variable is given a unique name all of the uses reached by that assignment are renamed

Detailed Proof of The PerronFrobenius Theorem

Proof of Theorem 1. Tao Lei CSAIL,MIT. Here we give the proofs of Theorem 1 and other necessary lemmas or corollaries.

Worst case analysis for a general class of on-line lot-sizing heuristics

Week 4. (1) 0 f ij u ij.

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010

T 1. The value function v(x) is the expected net gain when using the optimal stopping time starting at state x:

Value and Policy Iteration

The common-line problem in congested transit networks

Reinforcement Learning

6.842 Randomness and Computation March 3, Lecture 8

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Notes on Blackwell s Comparison of Experiments Tilman Börgers, June 29, 2009

CSE 202 Homework 4 Matthias Springer, A

Lecture 10 Algorithmic version of the local lemma

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

15.081J/6.251J Introduction to Mathematical Programming. Lecture 18: The Ellipsoid method

Graph coloring, perfect graphs

Randomized Algorithms

On Mergings in Acyclic Directed Graphs

Transcription:

Markov Decision Processes Lecture notes for the course Games on Graphs B. Srivathsan Chennai Mathematical Institute, India 1 Markov Chains We will define Markov chains in a manner that will be useful to study simple stochastic games. The usual definition of Markov chains is more general. Definition 1 (Markov Chains) A Markov Chain G is a graph (V avg, E). The vertex set V avg is of the form {1, 2,..., n 1, n}. Vertex n 1 is called the 0-sink and vertex n is called the 1-sink. From every vertex, there are two outgoing edges each with probability 1. Note that both the outgoing edges can lead to the same vertex. Only edges from n 1 2 and n are self loops. A Markov Chain of n vertices can be specified using its n n transition matrix. Each entry of this matrix is either 0, 1 or 1. A Markov Chain is said to be stopping if from every 2 vertex, there is a path to a sink vertex. Definition 2 (Reachability probabilities) For a stopping Markov Chain G, let v := v 1, v 2,..., v n denote the probabilities to reach the 1-sink n from each vertex. Then, v satisfies the following constraints: v n = 1 v n 1 = 0 for each 1 i n 2 v i = 1 2 v j + 1 2 v k where j and j are children of i Let us restrict v to the first n 2 coordinates. If v is seen as a column vector, then the above system of equations can be written as v = Qv + b, where Q is the n 2 n 2 transition matrix of G (not containing sink states n 1, n) and b is a column vector containing either 0, 1 or 1 (giving the 1 step probability to reach 1-sink). A solution to this equation gives 2 the reachability probabilities. 1

2 Markov Decision Processes Here are some results about Markov chains. Lemma 3 Let Q be a transition matrix of a stopping Markov chain. Then, lim n Q n = 0. See Theorem 11.3 (page 417) of [GS] Lemma 4 Let Q be a transition matrix of a stopping Markov chain. invertible. Moreover, I Q = I + Q + Q 2 +.... See Theorem 11.4 (page 418) of [GS] Then I Q is Theorem 5 For a stopping Markov chain G, the system of equations v = Qv + b in Definition 2 has a unique solution, given by v = (I Q) 1 b. Follows from Lemma 4. 2 Markov Decision Processes Definition 6 (Markov Decision Process) A Markov Decision Process (MDP) G is a graph (V avg V max, E). The vertex set is of the form {1, 2,..., n 1, n}. From every vertex, there are two outgoing edges. Note that both the outgoing edges can lead to the same vertex. Each outgoing edge from vertices in V avg is marked with probability 1 2. Vertex n 1 is called the 0-sink and vertex n is called the 1-sink. Only edges from n 1 and n are self loops. For convenience, we will call vertices in V max as max vertices and vertices in V avg as average vertices. Definition 7 (Strategy/Policy) A strategy σ in an MDP is a function σ : V max V max V avg which chooses an edge for every max vertex. Strategies are also called policies. Observe that we have restricted to positional strategies in the above definition. Other kinds of strategies for this model are out of scope of this course. Each strategy σ for G gives a Markov Chain G σ. An MDP is said to be stopping if for every σ, the Markov Chain G σ is stopping. Corresponding to G σ is the vector denoting reachability probabilities σ given in Definition 2. Definition 8 (Value vector) For an MDP, we define its value vector as: max σ v σ (1) max σ v σ (2) v :=. max σ v σ (n) In the following, we will give different methods to compute the value vector of an MDP.

Markov Decision Processes 3 2.1 Charactering value vector using constraints Similar to the constraints for Markov chain given in Definition 2, we will now give constraints for MDPs. Given and MDP G, consider the following set of constraints over the variables w 1, w 2,..., w n : 0 w i 1 for every i (1.1) w n = 1 w n 1 = 0 for each i n 2 in V max w i = max(w j, w k ) where j and k are children of i for each i n 2 in V avg w i = 1 2 w j + 1 2 w k where j and k are children of i Theorem 9 For a stopping MDP, its value vector v is the unique solution to (1.1). Before proving the above theorem, let us mention that just from the definition of v, each of its components need not be related, that is, v(1) and v(2) can arise out of different strategies. But Theorem 9 relates all these values. Therefore, in order to prove the above theorem, it will be convenient if we can get a link between the components of v: we will show that the whole of v can be obtained using special types of strategies. Definition 10 (Optimal strategies) A strategy σ for an MDP is said to be optimal if for every max vertex i, the value v σ (i) equals max(v σ (j), v σ (k)) where j and k are its children. Lemma 11 For every optimal strategy ρ, the vector v ρ equals the value vector v of the MDP. Clearly, v ρ v. We will now show that v σ v ρ for every strategy σ. This will prove the lemma. Consider an arbitrary strategy σ. The value vectore v σ is obtained as the solution to some equation: v σ = Q σ v σ + b σ. Therefore, v σ = (I Q) 1 b σ. We will first show that v ρ Q σ v ρ + b σ. For an average vertex i, the left hand side gives v ρ (i) and the right hand sige gives v ρ (j) + v ρ (k). Hence both are equal. For a max vertex i, the left hand side gives max(v ρ (j), v ρ (k)) and the right hand side gives either v ρ (j) or v ρ (k) (assuming that j and k are the children of i). This shows that v ρ Q σ v ρ + b σ. Rearranging, we get that v ρ (I Q) 1 b σ, which gives v ρ v σ. of Theorem 9 Each solution to (1.1) corresponds to an optimal strategy. From Lemma 11, every optimal strategy gives the unique value vector of the MDP. This shows that (1.1) cannot have multiple solutions. To show that there is a solution, it is sufficient to prove existence of an optimal strategy. This is shown in Theorem 13 in the next section.

4 Markov Decision Processes Algorithm 1.1: Strategy improvement algorithm for MDPs, also known as policy iteration 1 algorithm strategy improvement(g) 2 σ an a r b i t r a r y p o s i t i o n a l s t r a t e g y 3 v σ p r o b a b i l i t i e s to reach 1 s i n k in G σ 4 repeat 5 pick a node i V max s. t. v σ (i) < max(v σ (j), v σ (k)) where j,k are i t s c h i l d r e n 6 σ (i) arg max{v σ (j), v σ (k)} 7 σ σ 8 v σ p r o b a b i l i t i e s to reach 1 s i n k in G σ 9 until σ i s optimal 2.2 Strategy improvement/policy iteration We will now give an algorithm to compute an optimal strategy for an MDP. It starts with an arbitrary initial strategy and over repeated iterations, it converges to an optimal one (refer to Algorithm 1.1). It is clear that if the algorithm terminates, it computes an optimal strategy. We will now show termination. Let us denote the strategy obtained after the p th iteration by σ p and the value corresponding to this as v p. Lemma 12 The values satisfy v p v p+1 and there is at least one vertex where the inequality is strict. Strategy σ p induces a Markov Chain G p. The values v p are obtained as a solution to some equations: v p = Q p v + b p as given in Definition 2. Similarly, v p+1 = Q p+1 v p+1 + b p+1. Note that σ p+1 is obtained from σ p by modifying the strategy at a single vertex. Therefore, the matrices Q p and Q p+1 can differ at most at one row i. Wlog, let us say that Q p (i, j) = 1 and Q p+1 (i, k) = 1. The rest of the entries are 0. Moreover, v p (k) v p (j) > 0. The case where the change of strategy leads to modification of b vector can be handled similarly. Let us now look at v p+1 v p. v p+1 v p = Q p+1 v p+1 Q p v p as b p = b p+1 v p+1 v p = Q p+1 v p+1 Q p+1 v p + Q p+1 v p Q p v p (I Q p+1 )(v p+1 v p ) = (Q p+1 Q p )v p (I Q p+1 )(v p+1 v p ) = (0,..., 0, v p (k) v p (j), 0,..., 0) T In the last line, the quantity v p (k) v p (j) appears in the i th coordinate. This, we know is strictly bigger than 0. Since G p+1 is stopping, from Lemma 4 we get that I Q p+1 is invertible. Therefore, v p+1 v p = (I Q p+1 ) 1 (0,..., 0, v p (k) v p (j), 0,..., 0) T. From Lemma 4, we also know that the inverse has all non-negative entries, and the diagonal entries are strictly positive. This shows that v p+1 v p, where the inequality is strict in at least one component. Theorem 13 The strategy improvement algorithm terminates with an optimal strategy.

Markov Decision Processes 5 From Lemma 12, we can infer that after each iteration, a strategy which has not been seen before is obtained. Since the number of strategies is finite, the algorithm terminates. Moreover, the termination condition says that the final obtained strategy is optimal. 2.3 Linear Programming Definition 14 (Linear Program for an MDP) For an MDP G, we associate the following linear program: Minimize Σ i x i subject to the constraints 0 x i 1 for every i x n = 1 x n 1 = 0 for each i n 2 in V max x i x j, x k where j and k are children of i for each i n 2 in V avg x i = 1 2 x j + 1 2 x k where j and k are children of i Theorem 15 For a stopping MDP, the linear program of Definition 14 has a unique solution, and this solution gives the value vector v of the MDP. From 9, the value vector v is one solution to constraints of the above LP. We will now show that for any other feasible solution x of the LP, we will have v x. This will imply that v is the solution which minimizes the sum, and hence will be the unique solution to the LP. Let x be a feasible solution. Suppose there are some vertices where v is strictly bigger than x. Let: U = { i 1 i n, v i x i > 0 and v i x i is maximum }. The above set U gives the set of vertices where the difference between v and x is maximum. We will now show a property about this set U. Suppose i V max belongs to U. Since v is tight (that is, it satisfies (1.1)), we have v i = v j for some child j of i. Also, as x is a feasible solution, x i x j. As i belongs to U, we have v i x i v j x j and substituting v i = v j gives x i x j. Combining this with the previous argument gives x i = x j and hence v i x i = v j x j. This shows that j belongs to U as well. Similarly, one can show that if i V avg belongs to U, both its children j and k belong to U. Since G is stopping, this property entails that if U is non-empty, then either 0-sink or 1-sink belongs to U. But for the sink vertices both v and x assign the same value. This gives a contradiction to the assumption that there are some vertices where v gives a strictly bigger value than x.

6 Markov Decision Processes Algorithm 1.2: Value iteration algorithm, also known as Successive approximation algorithm 1 algorithm value i t e r a t i o n (G) 2 l e t u be the v e c t o r a s s i g n i n g 1 to 1 s i n k and 0 to e v e r y t h i n g else 3 repeat 4 d e f i n e u as f o l l o w s : 5 u (i) = max(u(j), u(k)), i f i i s a max node and j, k are i t s c h i l d r e n 6 u (i) = 1 2 u(j) + 1 2u(k), i f i i s an average node and j, k are i t s c h i l d r e n 7 u (n 1) = 0 8 u (n) = 1 9 l e t u u 10 until u s a t i s f i e s (1.1) 2.4 Value iteration Algorithm 1.2 gives yet another method to compute the value vector. This method need not necessarily terminate. It converges to the value vector in the limit. Let v n be the vector obtained after n iterations. The procedure maintains the following invariant: v m (i) = max(v m 1 (j), v m 1 (k)) if i V max with children j, k (1.2) v m (i) = 1 2 v m 1(j) + 1 2 v m 1(k) if i V avg with children j, k Theorem 16 For a stopping MDP, the sequence v m converges to the value vector. 1 We first show that the sequence v m converges. We can prove by induction that v m is monotonically increasing (at each vertex). It is also bounded above by 1. Therefore v m converges, say to v. From (1.2), we get: lim m v m(i) = lim m max(v m 1(j), v m 1 (k)) if i V max with children j, k (1.3) lim v m(i) = lim (1 m m 2 v m 1(j) + 1 2 v m 1(k)) if i V avg with children j, k From the fact that sequence v n converges to v and from (1.3), we get: v (i) = max(v (j), v (k)) v (i) = 1 2 v (j) + 1 2 v (k) if i V max with children j, k if i V avg with children j, k Moreover every v m assigns 0 to the 0-sink and 1 to the 1-sink, giving us that v is 0 and 1 for the 0 and 1-sink respectively. Hence v is a solution to the system of constraints given in (1.1). From Theorem 9, v is the value vector for the MDP. 1 The proof of this theorem written in this document was given by one of the students J. Kishor (B. Sc Maths and C.S. - 3 rd year)

Markov Decision Processes 7 Intuitively, we can think of the values in v m as the maximum probability to reach the 1-sink in at most m steps (using any strategy, which need not necessarily be positional). In the limit, this sequence converges to the probability to reach 1-sink. The above theorem says that this value equals the value vector of the MDP as in Definition 8, which we know can be obtained using positional strategies. This therefore gives a justification that for stopping MDPs, the reachability probabilities can be obtained using positional strategies.

8 REFERENCES References [GS] Charles M. Grinstead and Laurie J. Snell. Introduction to Probability.