Markov Decision Processes

Similar documents
Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Generalized Semi- Markov Processes (GSMP)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Machine Learning Brett Bernstein

Differentiable Convex Functions

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

Unit 6: Sequences and Series

On forward improvement iteration for stopping problems

TCOM 501: Networking Theory & Fundamentals. Lecture 3 January 29, 2003 Prof. Yannis A. Korilis

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

ROLL CUTTING PROBLEMS UNDER STOCHASTIC DEMAND

9 - Markov processes and Burt & Allison 1963 AGEC

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Introduction to Optimization Techniques. How to Solve Equations

Chapter 9: Numerical Differentiation

ECE-S352 Introduction to Digital Signal Processing Lecture 3A Direct Solution of Difference Equations

Axioms of Measure Theory

A New Solution Method for the Finite-Horizon Discrete-Time EOQ Problem

Optimization Methods MIT 2.098/6.255/ Final exam

Infinite Sequences and Series

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

subject to A 1 x + A 2 y b x j 0, j = 1,,n 1 y j = 0 or 1, j = 1,,n 2

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Vector Quantization: a Limiting Case of EM

Find a formula for the exponential function whose graph is given , 1 2,16 1, 6

The Method of Least Squares. To understand least squares fitting of data.

Random Models. Tusheng Zhang. February 14, 2013

Approximate Dynamic Programming by Linear Programming for Stochastic Scheduling

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

PC5215 Numerical Recipes with Applications - Review Problems

An Introduction to Randomized Algorithms

Classification of problem & problem solving strategies. classification of time complexities (linear, logarithmic etc)

Entropy Rates and Asymptotic Equipartition

Optimization Methods: Linear Programming Applications Assignment Problem 1. Module 4 Lecture Notes 3. Assignment Problem

Optimally Sparse SVMs

2.4 - Sequences and Series

Lecture 7: October 18, 2017

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Massachusetts Institute of Technology

Math 312 Lecture Notes One Dimensional Maps

Introduction to Machine Learning DIS10

Chapter 2: Numerical Methods

A NEW APPROACH TO SOLVE AN UNBALANCED ASSIGNMENT PROBLEM

Recurrence Relations

6.3 Testing Series With Positive Terms

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Design and Analysis of Algorithms

CHAPTER 1 SEQUENCES AND INFINITE SERIES

Problem Set 4 Due Oct, 12

(b) What is the probability that a particle reaches the upper boundary n before the lower boundary m?

Dynamic Programming. Sequence Of Decisions

Dynamic Programming. Sequence Of Decisions. 0/1 Knapsack Problem. Sequence Of Decisions

2 Markov Chain Monte Carlo Sampling

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Sequences and Series of Functions

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Application to Random Graphs

Notes on iteration and Newton s method. Iteration

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Topics. Homework Problems. MATH 301 Introduction to Analysis Chapter Four Sequences. 1. Definition of convergence of sequences.

Constraint Satisfaction. Algorithm Design (3) Constraint Satisfaction and Optimization. Formalization of Constraint Satisfaction Problems

Statistical Inference Based on Extremum Estimators

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm

Dynamic Policy Programming with Function Approximation: Supplementary Material

Sequences. A Sequence is a list of numbers written in order.

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Roger Apéry's proof that zeta(3) is irrational

is also known as the general term of the sequence

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Mathematical Notation Math Finite Mathematics

( ) = p and P( i = b) = q.

Alternating Series. 1 n 0 2 n n THEOREM 9.14 Alternating Series Test Let a n > 0. The alternating series. 1 n a n.

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

The target reliability and design working life

Lecture 10 October Minimaxity and least favorable prior sequences

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Problem Set 2 Solutions

CS321. Numerical Analysis and Computing

De Moivre s Theorem - ALL

Ma 530 Infinite Series I

Linear Programming! References! Introduction to Algorithms.! Dasgupta, Papadimitriou, Vazirani. Algorithms.! Cormen, Leiserson, Rivest, and Stein.

Mixed Acceptance Sampling Plans for Multiple Products Indexed by Cost of Inspection

Lecture 19: Convergence

HOMEWORK 2 SOLUTIONS

Analysis of Algorithms. Introduction. Contents

MATH 304: MIDTERM EXAM SOLUTIONS

Expectation-Maximization Algorithm.

Linear Classifiers III

The Simplex algorithm: Introductory example. The Simplex algorithm: Introductory example (2)

Math 61CM - Solutions to homework 3

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Lecture 7: Properties of Random Samples

OPTIMAL PIECEWISE UNIFORM VECTOR QUANTIZATION OF THE MEMORYLESS LAPLACIAN SOURCE

Support Vector Machines and Kernel Methods

Time-Domain Representations of LTI Systems

Transcription:

Markov Decisio Processes Defiitios; Statioary policies; Value improvemet algorithm, Policy improvemet algorithm, ad liear programmig for discouted cost ad average cost criteria. Markov Decisio Processes 1

Markov Decisio Process Let X = {X 0, X 1, } be a system descriptio process o state space E ad let D = {D 0, D 1, } be a decisio process with actio space A. The process (X, D) is a Markov decisio process if, for j E ad = 0, 1,, + 1 =,,..., 0, 0 = + 1 =, Furthermore, for each k A, let f k be a cost vector ad P k be a oe-step trasitio probability matrix. The the cost f k (i) is icurred wheever X = i ad D = k, ad { } { } P X j X D X D P X j X D { + 1 = =, = } = k (, ) P X j X i D k P i j The problem is to determie how to choose a sequece of actios i order to miimize cost. Markov Decisio Processes 2

Policies A policy is a rule that specifies which actio to take at each poit i time. Let D deote the set of all policies. I geeral, the decisios specified by a policy may deped o the curret state of the system descriptio process be radomized (deped o some exteral radom evet) also deped o past states ad/or decisios A statioary policy is defied by a (determiistic) actio fuctio that assigs a actio to each state, idepedet of previous states, previous actios, ad time. Uder a statioary policy, the MDP is a Markov chai. Markov Decisio Processes 3

Cost Miimizatio Criteria Sice a MDP goes o idefiitely, it is likely that the total cost will be ifiite. I order to meaigfully compare policies, two criteria are commoly used: 1. Expected total discouted cost computes the preset worth of future costs usig a discout factor < 1, such that oe dollar obtaied at time = 1 has a preset value of at time = 0. Typically, if r is the rate of retur, the = 1/(1 + r). The expected total discouted cost is 0 ( ) E f D X = 2. The log ru average cost is 1 lim m m 1 m = 0 f D ( X ) Markov Decisio Processes 4

Optimizatio with Statioary Policies If the state space E is fiite, there exists a statioary policy that solves the problem to miimize the discouted cost: v () i mi v () i d, where v d () i E = = d f ( X ) X i 0 D 0 = d D = If every statioary policy results i a irreducible Markov chai, there exists a statioary policy that solves the problem to miimize the average cost: * 1 m 1 ϕ = mi ϕd, where ϕd = lim f ( ) 0 D X d D m = m Markov Decisio Processes 5

Computig Expected Discouted Costs Let X = {X 0, X 1, } be a Markov chai with oe-step trasitio probability matrix P, let f be a cost fuctio that assigs a cost to each state of the M.C., ad let (0 < < 1) be a discout factor. The the expected total discouted cost is 1 g() i = E f ( X ) ( ) ( ) 0 X0 = i = f i I P = Why? Startig from state i, the expected discouted cost ca be foud recursively as g i = f i + P g j ( ) ( ) ( ), or g= f + Pg Note that the expected discouted cost always depeds o the iitial state, while for the average cost criterio the iitial state is uimportat. j ij Markov Decisio Processes 6

Solutio Procedures for Discouted Costs Let v be the (vector) optimal value fuctio whose ith compoet is v ( i) = mi v d ( i) d D For each i E, v i = mi fk i + Pk i, j v j k A j E These equatios uiquely determie v. { } () () ( ) ( ) If we ca somehow obtai the values v that satisfy the above equatios, the the optimal policy is the vector a, where { } k k () arg mi () (, ) ( ) a i = f i + P i j v j k A j E arg mi is the argumet that miimizes Markov Decisio Processes 7

Value Iteratio for Discouted Costs Make a guess keep applyig the optimal value equatios util the fixed poit is reached. Step 1. Choose ε > 0, set = 0, let v 0 (i) = 0 for each i i E. Step 2. For each i i E, fid v +1 (i) as { } () () ( ) ( ) v 1 i = mi fk i + Pk i, j v j + k A j E { ( ) ( )} Step 3. Let δ = max v + 1 i v i i E Step 4. If δ < ε, stop with v = v +1. Otherwise, set = +1 ad retur to Step 2. Markov Decisio Processes 8

Policy Improvemet for Discouted Costs Start myopic, the cosider loger-term cosequeces. Step 1. Set = 0 ad let a 0 (i) = arg mi k A f k ( i) Step 2. Adopt the cost vector ad trasitio matrix: f ( i) = f ()( i) P( i, j) = P ()( i, j a ) i a i Step 3. Fid the value fuctio v= ( I P) 1 f Step 4. Re-optimize: a 1 i = arg mi fk i + Pk i, j v j { } () () ( ) ( ) + j E k A Step 5. If a +1 (i) = a (i), the stop with v = v ad a = a (i). Otherwise, set = + 1 ad retur to Step 2. Markov Decisio Processes 9

Liear Programmig for Discouted Costs Cosider the liear program: max u( i) i E () () + ( ) ( ) s.t. u i f i P i, j u j for each i, k k j E The optimal value of u(i) will be v (i), ad the optimal policy is idetified by the costraits that hold as equalities i the optimal solutio (slack variables equal 0). k Note: the decisio variables are urestricted i sig! Markov Decisio Processes 10

Log Ru Average Cost per Period For a give policy d, its log ru average cost could be foud from its cost vector f d ad oe-step trasitio probability matrix P d : First, fid the limitig probabilities by solvig The ϕ ( ) = = π π P i, j, j E; π 1 j i d j i E j E m 1 f = 0 d X d = lim = fd ( j) π j m m j E ( )( X ) So, i priciple we could simply eumerate all policies ad choose the oe with the smallest average cost ot practical if A ad E are large. Markov Decisio Processes 11

Recursive Equatio for Average Cost Assume that every statioary policy yields a irreducible Markov chai. There exists a scalar ϕ ad a vector h such that for all states i i E, () { () ( ) ( )} * ϕ + hi = mi f i k + P ijh, j k k A j E The scalar ϕ is the optimal average cost ad the optimal policy is foud by choosig for each state the actio that achieves the miimum o the right-had-side. The vector h is uique up to a additive costat as we will see, the differece betwee h(i) - h(j) represets the icrease i total cost from startig out i state i rather tha j. Markov Decisio Processes 12

Relatioships betwee Discouted Cost ad Log Ru Average Cost If a cost of c is icurred each period ad is the discout factor, the the total discouted cost is c v= c = = 0 1 Therefore, a total discouted cost v is equivalet to a * average cost of c = (1-)v per period, so lim( 1 ) v ( i) = ϕ Let v 1 be the optimal discouted cost vector, ϕ* be the optimal average cost ad h be the mystery vector from the previous slide. 1 ( ) ( ) = ( ) ( ) lim v i v j h i h j Markov Decisio Processes 13

Policy Improvemet for Average Costs Desigate oe state i E to be state umber 1 Step 1. Set = 0 ad let a 0 (i) = arg mi k A f k ( i) Step 2. Adopt the cost vector ad trasitio matrix: f i = f i P i, j = P i, j ( ) ()( ) ( ) ()( ) a i a i Step 3. With h(1) = 0, solve ϕ + h= f + Ph Step 4. Re-optimize: a 1 i = arg mi fk i + Pk i, j h j Step 5. If a +1 (i) = a (i), the stop with ϕ * = ϕ ad a * (i) = a (i). Otherwise, set = + 1 ad retur to Step 2. { } () () ( ) ( ) + j E k A Markov Decisio Processes 14

Liear Programmig for Average Costs Cosider radomized policies: let w i (k) = P{D = k X = i}. A statioary policy has w i (k) = 1 for each k=a(i) ad 0 otherwise. The decisio variables are x(i,k) = w i (k)π(i). The objective is to miimize the expected value of the average cost (expectatio take over the radomized policy): x( i k) fk ( i) ( ) ( ) k ( ) x( i, k) = 1 mi =, ϕ i E k A s.t. x j, k = x i, k P i, j for each j E k A i E k A i E k A Note that oe costrait will be redudat ad may be dropped. Markov Decisio Processes 15