Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Similar documents
{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Bellman Optimality Equation for V*

Administrivia CSE 190: Reinforcement Learning: An Introduction

Reinforcement Learning

Reinforcement learning II

2D1431 Machine Learning Lab 3: Reinforcement Learning

19 Optimal behavior: Game theory

A Fast and Reliable Policy Improvement Algorithm

Reinforcement Learning and Policy Reuse

Markov Decision Processes

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Artificial Intelligence Markov Decision Problems

Reinforcement learning

We will see what is meant by standard form very shortly

CS 188: Artificial Intelligence Spring 2007

Finding Correlated Equilibria in General Sum Stochastic Games

Bayesian Networks: Approximate Inference

Chapter 3. Vector Spaces

CS 188: Artificial Intelligence Fall Announcements

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Introduction to Numerical Analysis

1 Online Learning and Regret Minimization

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Uninformed Search Lecture 4

Applying Q-Learning to Flappy Bird

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

Metrics for Finite Markov Decision Processes

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

CS S-12 Turing Machine Modifications 1. When we added a stack to NFA to get a PDA, we increased computational power

Chapter 2 Finite Automata

Hidden Markov Models

A Variance Analysis for POMDP Policy Evaluation

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Search: The Core of Planning

1 Linear Least Squares

Hoeffding, Azuma, McDiarmid

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Math 270A: Numerical Linear Algebra

A Generalized Reinforcement-Learning Model: Convergence and. Applications

REINFORCEMENT learning (RL) was originally studied

Review of Gaussian Quadrature method

Finite Horizon Risk Sensitive MDP and Linear Programming

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

CS 188: Artificial Intelligence

Module 6: LINEAR TRANSFORMATIONS

SOLVING SYSTEMS OF EQUATIONS, ITERATIVE METHODS

The problems that follow illustrate the methods covered in class. They are typical of the types of problems that will be on the tests.

Chapter 3 Solving Nonlinear Equations

Numerical Integration

Point-Based POMDP Algorithms: Improved Analysis and Implementation

Matrix Solution to Linear Equations and Markov Chains

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

Learning Moore Machines from Input-Output Traces

Math 4310 Solutions to homework 1 Due 9/1/16

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1

CS 188: Artificial Intelligence Fall 2010

Sturm-Liouville Theory

A ROLLOUT CONTROL ALGORITHM FOR DISCRETE-TIME STOCHASTIC SYSTEMS

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

The Value 1 Problem for Probabilistic Automata

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Where did dynamic programming come from?

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

CAAM 453 NUMERICAL ANALYSIS I Examination There are four questions, plus a bonus. Do not look at them until you begin the exam.

Ordinary Differential Equations- Boundary Value Problem

Bias and Variance Approximation in Value Function Estimates

Numerical Linear Algebra Assignment 008

Power Constrained DTNs: Risk MDP-LP Approach

4.5 JACOBI ITERATION FOR FINDING EIGENVALUES OF A REAL SYMMETRIC MATRIX. be a real symmetric matrix. ; (where we choose θ π for.

COSC 3361 Numerical Analysis I Numerical Integration and Differentiation (III) - Gauss Quadrature and Adaptive Quadrature

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

63. Representation of functions as power series Consider a power series. ( 1) n x 2n for all 1 < x < 1

Learning to Serve and Bounce a Ball

13: Diffusion in 2 Energy Groups

Lumpability and Absorbing Markov Chains

Bellman goes Relational

Regular expressions, Finite Automata, transition graphs are all the same!!

A Continuous-time Markov Decision Process Based Method on Pursuit-Evasion Problem

Sufficient condition on noise correlations for scalable quantum computing

Monte Carlo Value Iteration with Macro-Actions

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

The solutions of the single electron Hamiltonian were shown to be Bloch wave of the form: ( ) ( ) ikr

Compact, Convex Upper Bound Iteration for Approximate POMDP Planning

Today. Recap: Reasoning Over Time. Demo Bonanza! CS 188: Artificial Intelligence. Advanced HMMs. Speech recognition. HMMs. Start machine learning

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Chapter 3 MATRIX. In this chapter: 3.1 MATRIX NOTATION AND TERMINOLOGY

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

1.1. Linear Constant Coefficient Equations. Remark: A differential equation is an equation

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

1.4 Nonregular Languages

Math& 152 Section Integration by Parts

1 The Riemann Integral

LECTURE NOTE #12 PROF. ALAN YUILLE

Transcription:

Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo

Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model: Pr (s t s t 1, t 1 ) Rewrd model (i.e., utility): R(s t, t ) Discount fctor: 0 γ 1 Horizon (i.e., # of time steps): h Gol: find optiml policy π 2

Finite Horizon Policy evlution V h π s = h t=0 γ t Pr (S t = s S 0 = s, π)r(s, π t (s )) Recursive form (dynmic progrmming) V 0 π s = R(s, π 0 s ) V t π s = R s, π t s + γ Pr s s, π t s V t 1 π (s ) s 3

Optiml Policy π Finite Horizon V h π s V h π s π, s Optiml vlue function V (shorthnd for V π ) V 0 s = mx V t s = mx R(s, ) R s, + γ Pr s s, V t 1 (s ) s Bellmn s eqution 4

Vlue Itertion Algorithm vlueitertion(mdp) V 0 s mx R(s, ) s For t = 1 to h do V t s mx R s, + γ Pr s s s, V t 1 (s ) Return V s Optiml policy π t = 0: π 0 s rgmx R s, s t > 0: π t s rgmx R s, + γ Pr s s s, V t 1 (s ) NB: π is non sttionry (i.e., time dependent) s 5

Mtrix form: Vlue Itertion R : S 1 column vector of rewrds for V t : S 1 column vector of stte vlues T : S S mtrix of trnsition prob. for vlueitertion(mdp) V 0 mx R For t = 1 to h do V t mx R + γt V t 1 Return V 6

Infinite Horizon Let h Then V h π V π nd V h 1 π V π Policy evlution: V π s = R s, π s + γ s Pr s s, π s V π (s ) s Bellmn s eqution: V s = mx R s, + γ Pr s s s, V (s ) 7

Policy evlution Liner system of equtions V π s = R s, π s + γ s Pr s s, π s V π (s ) s Mtrix form: R: S 1 column vector of ste rewrds for π V: S 1 column vector of stte vlues for π T: S S mtrix of trnsition prob for π V = R + γtv 8

Solving liner equtions Liner system: V = R + γtv Gussin elimintion: I γt V = R Compute inverse: V = I γt 1 R Itertive methods Vlue itertion (.k.. Richrdson itertion) Repet V R + γtv 9

Contrction Let H(V) R + γtv be the policy evl opertor Lemm 1: H is contrction mpping. H V H V γ V V Proof H V H V = R + γtv R γtv (by definition) = γt V V (simplifiction) γ T V V (since AB A B ) = γ V V (since mx s s T(s, s ) = 1) 10

Convergence Theorem 2: Policy evlution converges to V π for ny initil estimte V lim n H(n) V = V π V Proof By definition V π = H 0, but policy evlution computes H V for ny initil V By lemm 1, H (n) V H n V γ n V V Hence, when n, then H (n) V H n 0 0 nd H V = V π V 11

Approximte Policy Evlution In prctice, we cn t perform n infinite number of itertions. Suppose tht we perform vlue itertion for k steps nd H k V H k 1 V = ε, how fr is H k V from V π? 12

Approximte Policy Evlution Theorem 3: If H k V H k 1 V ε then V π H k V ε 1 γ Proof V π H k V = H (V) H k V (by Theorem 2) = t=1 H t+k V H t+k 1 V t=1 H t+k (V) H t+k 1 V ( A + B A + B ) = t=1 γ t ε = ε 1 γ (by Lemm 1) 13

Optiml Vlue Function Non-liner system of equtions V s = mx R s, + γ Pr s s s, V (s ) s Mtrix form: R : S 1 column vector of rewrds for V : S 1 column vector of optiml vlues T : S S mtrix of trnsition prob for V = mx R + γt V 14

Contrction Let H (V) mx vlue itertion R + γt V be the opertor in Lemm 3: H is contrction mpping. H V H V γ V V Proof: without loss of generlity, let H V let s = rgmx s H (V)(s) nd R s, + γ s Pr s s, V(s ) 15

Proof continued: Contrction Then 0 H V s H V s (by ssumption) R s, s + γ Pr s s, s s V s (by definition) R s, s γ Pr s s, s s V s = γ Pr s s s, s V s V s γ Pr s s, s s V V (mxnorm upper bound) = γ V V (since Pr s s, s s = 1) Repet the sme rgument for H V s H (V)(s) nd for ech s 16

Convergence Theorem 4: Vlue itertion converges to V for ny initil estimte V lim n H (n) V = V V Proof By definition V = H 0, but vlue itertion computes H V for some initil V By lemm 3, H (n) V H n V γ n V V Hence, when n, then H (n) V H n 0 0 nd H V = V V 17

Vlue Itertion Even when horizon is infinite, perform finitely mny itertions Stop when V n V n 1 ε vlueitertion(mdp) V 0 mx R ; n 0 Repet n n + 1 V n mx Until V n V n 1 Return V n R + γt V n 1 ε 18

Induced Policy Since V n V n 1 ε, by Theorem 4: we know tht V n V ε 1 γ But, how good is the sttionry policy π n s extrcted bsed on V n? π n s = rgmx How fr is V π n from V? R s, + γ Pr s s, V n (s ) s 19

Induced Policy Theorem 5: V π n V Proof 2ε 1 γ V π n V = Vπ n V n + V n V V π n V n + V n V ( A + B A + B ) = H π n (V n ) V n + V n H V n ε + ε 1 γ 1 γ = 2ε 1 γ (by Theorems 2 nd 4) 20

Summry Vlue itertion Simple dynmic progrmming lgorithm Complexity: O(n A S 2 ) Here n is the number of itertions Cn we optimize the policy directly insted of optimizing the vlue function nd then inducing policy? Yes: by policy itertion 21