ARTIFICIAL INTELLIGENCE. Markov decision processes

Similar documents
RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Algorithmic Discrete Mathematics 6. Exercise Sheet

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Randomized Perfect Bipartite Matching

Network Flows: Introduction & Maximum Flow

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Problem Set If all directed edges in a network have distinct capacities, then there is a unique maximum flow.

Macroeconomics 1. Ali Shourideh. Final Exam

Vehicle Arrival Models : Headway

Chapter 7: Inverse-Response Systems

Suggested Solutions to Midterm Exam Econ 511b (Part I), Spring 2004

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

CS4445/9544 Analysis of Algorithms II Solution for Assignment 1

BU Macro BU Macro Fall 2008, Lecture 4

EECE 301 Signals & Systems Prof. Mark Fowler

1 Motivation and Basic Definitions

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Presentation Overview

Discussion Session 2 Constant Acceleration/Relative Motion Week 03

6.8 Laplace Transform: General Formulas

Lecture 26. Lucas and Stokey: Optimal Monetary and Fiscal Policy in an Economy without Capital (JME 1983) t t

The Residual Graph. 11 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm

CSC 364S Notes University of Toronto, Spring, The networks we will consider are directed graphs, where each edge has associated with it

CHAPTER 7: SECOND-ORDER CIRCUITS

To become more mathematically correct, Circuit equations are Algebraic Differential equations. from KVL, KCL from the constitutive relationship

18 Extensions of Maximum Flow

Physics 240: Worksheet 16 Name

Introduction to Congestion Games

Algorithms and Data Structures 2011/12 Week 9 Solutions (Tues 15th - Fri 18th Nov)

Flow networks. Flow Networks. A flow on a network. Flow networks. The maximum-flow problem. Introduction to Algorithms, Lecture 22 December 5, 2001

u(t) Figure 1. Open loop control system

The Residual Graph. 12 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm

EE202 Circuit Theory II

Soviet Rail Network, 1955

Math 2214 Solution Test 1 B Spring 2016

Network Flows UPCOPENCOURSEWARE number 34414

Phys1112: DC and RC circuits

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Machine Learning 4771

Graphs III - Network Flow

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Math 333 Problem Set #2 Solution 14 February 2003

Final Spring 2007

IB Physics Kinematics Worksheet

Reading from Young & Freedman: For this topic, read sections 25.4 & 25.5, the introduction to chapter 26 and sections 26.1 to 26.2 & 26.4.

Planning in POMDPs. Dominik Schoenberger Abstract

Selfish Routing. Tim Roughgarden Cornell University. Includes joint work with Éva Tardos

CHAPTER 12 DIRECT CURRENT CIRCUITS

8. Basic RL and RC Circuits

Laplace transfom: t-translation rule , Haynes Miller and Jeremy Orloff

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

CMU-Q Lecture 3: Search algorithms: Informed. Teacher: Gianni A. Di Caro

Solutions to Assignment 1

Echocardiography Project and Finite Fourier Series

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

Bias-Variance Error Bounds for Temporal Difference Updates

2. VECTORS. R Vectors are denoted by bold-face characters such as R, V, etc. The magnitude of a vector, such as R, is denoted as R, R, V

Sophisticated Monetary Policies. Andrew Atkeson. V.V. Chari. Patrick Kehoe

Algorithm Design and Analysis

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

3.1 More on model selection

Section 7.4 Modeling Changing Amplitude and Midline

EE Control Systems LECTURE 2

Exponential Smoothing

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Viterbi Algorithm: Background

1 Review of Zero-Sum Games

Network Flow. Data Structures and Algorithms Andrei Bulatov

5.1 - Logarithms and Their Properties

A Dynamic Model of Economic Fluctuations

Additional Methods for Solving DSGE Models

Types of Exponential Smoothing Methods. Simple Exponential Smoothing. Simple Exponential Smoothing

E β t log (C t ) + M t M t 1. = Y t + B t 1 P t. B t 0 (3) v t = P tc t M t Question 1. Find the FOC s for an optimum in the agent s problem.

Reminder: Flow Networks

= ( ) ) or a system of differential equations with continuous parametrization (T = R

Math 2214 Solution Test 1A Spring 2016

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Civil and Environmental Engineering

Anno accademico 2006/2007. Davide Migliore

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems

Soviet Rail Network, 1955

16 Max-Flow Algorithms and Applications

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

An introduction to the theory of SDDP algorithm

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

KINEMATICS IN ONE DIMENSION

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

Admin MAX FLOW APPLICATIONS. Flow graph/networks. Flow constraints 4/30/13. CS lunch today Grading. in-flow = out-flow for every vertex (except s, t)

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Chapter 6. Laplace Transforms

What is maximum Likelihood? History Features of ML method Tools used Advantages Disadvantages Evolutionary models

CS 473G Lecture 15: Max-Flow Algorithms and Applications Fall 2005

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems

Chapter 7: Solving Trig Equations

Chapter 6. Laplace Transforms

Introduction to SLE Lecture Notes

Selfish Routing and the Price of Anarchy. Tim Roughgarden Cornell University

Transcription:

INFOB2KI 2017-2018 Urech Univeriy The Neherland ARTIFICIAL INTELLIGENCE Markov deciion procee Lecurer: Silja Renooij Thee lide are par of he INFOB2KI Coure Noe available from www.c.uu.nl/doc/vakken/b2ki/chema.hml

PageRank (Google) PageRank can be underood a a) A Markov Chain b) A Markov Deciion Proce c) A Parially Obervable Markov Deciion Proce d) None of he above 2

Markov model Markov model = ochaic model ha aume Markov propery. ochaic model: model a proce where he ae depend on previou ae in a non deerminiic way. Markov propery: he probabiliy diribuion of fuure ae, condiioned on boh pa and preen value, depend only upon he preen ae: given he preen, he fuure doe no depend on he pa Generally, hi aumpion enable reaoning and compuaion wih he model ha would oherwie be inracable. 3

Markov model ype Predicion Planning Fully obervable Markov chain MDP (Markov deciion proce) Parially obervable Hidden Markov model POMDP (Parially obervable Markov deciion proce) Typically for opimiaion purpoe Predicion model can be repreened a variable level by a (Dynamic) Bayeian nework: S 1 S 2 S 3 S 1 S 2 S 3 O 1 O 2 O 3 4

MDP: ouline Search in non deerminiic environmen Soluion: opimal policy (plan) of acion ha maximize reward (deciion heoreic planning) Bellman equaion and value ieraion Link wih learning 5

Running example: Grid World A maze like problem The agen live in a grid, where wall block he agen pah Noiy movemen: acion do no alway go a planned If wall in choen direcion, hen ay pu; 80% of he ime, he acion Norh ake he agen Norh 10% of he ime, Norh ake he agen We; 10% Ea (ame deviaion for oher acion) The agen receive reward each ime ep Small living reward each ep (can be negaive) Big reward come a he end (good or bad) Goal: maximize um of reward

Grid World Acion Deerminiic Grid World Sochaic Grid World 7

Goal, reward and opimaliy crieria Tradiional planning goal can be encoded in reward funcion; effec of raniion i uncerain Example: achieving a ae aifying propery P a minimal co i encoded by making any ae aifying P a zeroreward aborbing ae, and aigning all oher ae negaive reward. Reward are addiive and ime eparable, and objecive i o maximize expeced oal reward; fuure reward may be dicouned Planning horizon can be finie, infinie or indefinie (pecial cae of infinie: guaraneed o reach erminal ae) 8

Markov Deciion Procee MDP are non deerminiic earch problem An MDP i defined by: A e of ae S A e of acion a A A raniion funcion T(, a, ) Probabiliy ha a from lead o, i.e., P(, a) Alo called he model or he dynamic A reward funcion R(, a, ) Someime ju R() or R( ) A ar ae Someime a erminal ae 9

Wha i Markov abou MDP? Recall: Markov generally mean ha given he preen ae, he fuure and he pa are independen For Markov deciion procee, Markov mean acion oucome depend only on he curren ae Andrey Markov (1856 1922) Thi i ju like earch, where he ucceor funcion could only depend on he curren ae (no he hiory) 10

MDP Search Tree Each MDP ae projec a earch ree i a ae: a (,a, ) called a raniion T(,a, ) = P(, a) R(,a, ) 11

Policie In deerminiic earch problem, we waned an opimal plan: a equence of acion, from ar o a goal For MDP, we wan an opimal policy *: S A Example: opimal policy when R(, a, ) = 0.03 for all nonerminal ae A policy give an acion for each ae An opimal policy i one ha maximize expeced uiliy (reward) if followed Noe: an explici policy define a reflex agen 12

Opimal Policie - example R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 13

Uiliie of Reward Sequence Wha preference hould an agen have over reward equence? More or le? Now or laer? [1, 2, 2] or [2, 3, 4] [0, 0, 1] or [1, 0, 0] I reaonable o maximize he um of reward I alo reaonable o prefer reward now o reward laer A oluion: value of reward decay exponenially 14

Dicouning Worh Now Worh Nex Sep Worh In Two Sep 15

Epiodic ak: ineracion break naurally ino epiode, e.g., play of a game, rip hrough a maze. reurn give oal reward from ime o ime T, ending an epiode. Coninuing ak: ineracion doe no have naural epiode. dicouned reurn where γ, 0 γ 1, i he dicoun rae Reurn in he long run T k k T r r r r R 1 0 1 2 1 0 1 3 2 2 1 k k k r r r r R farighed) 1 0 ed (horigh 16

Dicouning: implemenaion How o dicoun? Each ime we decend a level, we muliply in he dicoun once Why dicoun? Sooner reward probably do have higher uiliy han laer reward Alo help our algorihm converge Example: Value of receiving [1,2,3] wih dicoun of 0.5 = 1*1 + 0.5*2 + 0.25*3 Which i le han ha of [3,2,1] 17

Solving MDP 18

Opimal Quaniie Sae ha value V(): V * () = expeced reurn aring in and hereafer acing opimally acion a ae (Inermediae q ae ha value Q(,a) ) Traniion (,a, ) q ae The opimal policy: * () = opimal acion from ae Any policy ha i greedy wih repec o V* i an opimal policy. ae 19

Conider an arbirary policy precribing acion in ae. Wha i he value of following hi policy when in ae? Fir, le coniderhe deerminiic iuaion: Noie ake expeced value over all poible nex ae: where P() i given by he raniion funcion T(). Characerizing V π () ) ( ) ), (, ( ) ( ] [ ) ( 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 i i i k k k k k k k k k V R V r R r r r r r r r r R R E V ) ( ) ), (, ( )) (, ( ) ( 1 1 1 1 V R P V 20

Example: Policy Evaluaion π: Alway Go Righ π: Alway Go Forward V π i hown for each ae (indicaed in cell) 21

Characerizing opimalv*() )), ( max ( ') ( '),, ( '),, ( max ') ( ') ), (, ( ') ), (, ( max ) ( max ) ( ) ( * * ' ' * * a Q V a R a T V R T V V V a a Expeced reurn from ae i maximized by acing opimally in and hereafer he opimal value for a ae i obained when following he opimal policy : * 22 Thi equaion i called he Bellman equaion.

Uing V*() o obain *() The opimal policy can be exraced from V*(): * ( ) arg maxv ( ) arg max a ' T (, a, ') R(, a, ') V * ( ') uing one ep look ahead, i.e. ue he Bellman equaion once more o compue he given ummaion for all acion raher han reurning he max value, reurn he acion ha give he max value 23

Uing V*() o obain *() Back o Gridworld: Noie = 0.2 (i.e. move ucceful wih p=0.8; deviaion o lef/righ boh wih p=0.1) Dicoun γ = 0.9 Living reward R(,a, ) = 0 Opimal policy? Given V* (hown in cell), one ep look ahead produce he long erm opimal acion (hown a mall arrowhead). 24

Value Ieraion A Dynamic Programming algorihm for compuing V* 25

Value Ieraion (VI) Tree backup : define V k () a he opimal value of ill o be obained if he game end in k more ime ep Sar wih V 0 () = 0 for all (including erminal ) Terminal ae reward (if any) added a k=1 Given V k () value, compue for each V k 1 ( ) ( max max Q a a T (, a, ') (, a)) Repea unil convergence of V value ' k Theorem: VI will converge o unique opimal value Baic idea: approximaion ge refined oward opimal value R(, a, ') V k ( ') a V k+1 () V k ( ) 26

VI ini: k=0 Policy: baed on one ep lookahead, i.e. acion ha give max V k+1 value no ued in compuing V value! no ye inereing (only hown o demonrae change) defaul policy: N Noie = 0.2 Dicoun = 0.9 Living reward (R) = 0 27

k=1 Implemenaion of erminal ae e wih reward r : 1 acion: x (exi) T(e, x)= 1 R(e, x) = r In k=1 erminal ae ge aociaed reward, and no change in V value afer ha Noie = 0.2 Dicoun = 0.9 Living reward (R)= 0 28

k=2 Noie = 0.2 Dicoun = 0.9 Living reward = 0 29

k=3 Noie = 0.2 Dicoun = 0.9 Living reward = 0 30

k=4 Noie = 0.2 Dicoun = 0.9 Living reward = 0 31

k=5 Noie = 0.2 Dicoun = 0.9 Living reward = 0 32

k=6 Noie = 0.2 Dicoun = 0.9 Living reward = 0 33

k=7 Noie = 0.2 Dicoun = 0.9 Living reward = 0 34

k=8 Noie = 0.2 Dicoun = 0.9 Living reward = 0 35

k=9 Noie = 0.2 Dicoun = 0.9 Living reward = 0 36

k=10 Noie = 0.2 Dicoun = 0.9 Living reward = 0 37

k=11 Noie = 0.2 Dicoun = 0.9 Living reward = 0 38

k=12 Noie = 0.2 Dicoun = 0.9 Living reward = 0 39

k=100 Noie = 0.2 Dicoun = 0.9 Living reward = 0 40

Problem wih Value Ieraion Value ieraion repea he Bellman updae: V k 1( ) max T (, a, ') R(, a, ') V ( ') a ' Problem 1: I low O(S 2 A) per ieraion,a, Problem 2: The max a each ae rarely change k a, a Problem 3: The policy ofen converge long before he value 41

Policy Ieraion Alernaive approach for opimal value: Sep 1: Policy evaluaion: calculae reurn for ome fixed policy unil convergence Sep 2: Policy improvemen: updae policy uing one ep look ahead wih reuling converged (bu no opimal!) reurn a fuure value Repea ep unil policy converge Thi i policy ieraion I ill opimal! Can converge (much) faer under ome condiion 42

Recall: Policy Evaluaion π: Alway Go Righ π: Alway Go Forward V π i hown for each ae 43

Comparion VI and PI Boh are dynamic program for olving MDP and compue he ame hing (all opimal value) In value ieraion: Every ieraion updae boh value and (implicily) policy Don rack policy: aking max over acion implicily recompue i In policy ieraion: Do everal pae ha updae reurn wih fixed policy (each pa i fa: we conider only one acion, no all) Afer he policy i evaluaed, a new policy i choen (low like a value ieraion pa) The new policy will be beer (or we re done) 44

Double Bandi 45

Double-Bandi MDP Acion: Blue, Red Sae: Win, Loe 0.25 $0 No dicoun 100 ime ep $1 W 0.75 $2 0.25 $0 L $1 Boh ae have he ame value 1.0 0.75 $2 1.0 Noe he repreenaion a value raher han variable level! 46

Offline Planning Solving MDP i offline planning You deermine all quaniie hrough compuaion You need o know he deail of he MDP You do no acually play he game! Value 0.25 $0 No dicoun 100 ime ep Boh ae have he ame value Play Red Play Blue 150 100 $1 1.0 W 0.75 $2 0.25 $0 0.75 $2 L $1 1.0 47

Le Play! $2$2$0$2$2 $2$2$0$0$0 48

Online Planning Rule changed! Red win chance i differen.?? $0 $1 1.0 W?? $2?? $0?? $2 L $1 1.0 49

Le Play again! $0$0$0$2$0 $2$0$0$0$0 50

Wha Ju Happened? Tha wan planning, i wa learning! Specifically, reinforcemen learning There wa an MDP, bu you couldn olve i wih ju compuaion You needed o acually ac o figure i ou 51

PageRank (Google) PageRank can be underood a a) A Markov Chain b) A Markov Deciion Proce c) A Parially Obervable Markov Deciion Proce d) None of he above 52