Reinforcement Learning. Markov Decision Processes

Similar documents
Chapter 2: Evaluative Feedback

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Contraction Mapping Principle Approach to Differential Equations

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

4.8 Improper Integrals

Reinforcement Learning

Minimum Squared Error

Minimum Squared Error

3. Renewal Limit Theorems

Reinforcement learning

e t dt e t dt = lim e t dt T (1 e T ) = 1

Probability, Estimators, and Stationarity

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

0 for t < 0 1 for t > 0

September 20 Homework Solutions

A Kalman filtering simulation

Solutions to Problems from Chapter 2

1.0 Electrical Systems

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

S Radio transmission and network access Exercise 1-2

Physics 2A HW #3 Solutions

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

15/03/1439. Lecture 4: Linear Time Invariant (LTI) systems

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Convergence of Singular Integral Operators in Weighted Lebesgue Spaces

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

A new model for limit order book dynamics

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

Location is relative. Coordinate Systems. Which of the following can be described with vectors??

Particle Filtering. CSE 473: Artificial Intelligence Particle Filters. Representation: Particles. Particle Filtering: Elapse Time

white strictly far ) fnf regular [ with f fcs)8( hs ) as function Preliminary question jointly speaking does not exist! Brownian : APA Lecture 1.

3 Motion with constant acceleration: Linear and projectile motion

REAL ANALYSIS I HOMEWORK 3. Chapter 1

Bellman Optimality Equation for V*

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Average & instantaneous velocity and acceleration Motion with constant acceleration

Lecture 3: 1-D Kinematics. This Week s Announcements: Class Webpage: visit regularly

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

Neural assembly binding in linguistic representation

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions

An object moving with speed v around a point at distance r, has an angular velocity. m/s m

RESPONSE UNDER A GENERAL PERIODIC FORCE. When the external force F(t) is periodic with periodτ = 2π

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

TIMELINESS, ACCURACY, AND RELEVANCE IN DYNAMIC INCENTIVE CONTRACTS

Compulsory Flow Q-Learning: an RL algorithm for robot navigation based on partial-policy and macro-states

Some Inequalities variations on a common theme Lecture I, UL 2007

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor.

Tax Audit and Vertical Externalities

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

Presentation Overview

ECE Microwave Engineering. Fall Prof. David R. Jackson Dept. of ECE. Notes 10. Waveguides Part 7: Transverse Equivalent Network (TEN)

Dipartimento di Elettronica Informazione e Bioingegneria Robotics

( ) ( ) ( ) ( ) ( ) ( y )

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

Mathematics 805 Final Examination Answers

FURTHER GENERALIZATIONS. QI Feng. The value of the integral of f(x) over [a; b] can be estimated in a variety ofways. b a. 2(M m)

A Simple Method to Solve Quartic Equations. Key words: Polynomials, Quartics, Equations of the Fourth Degree INTRODUCTION

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ]

1 Sterile Resources. This is the simplest case of exhaustion of a finite resource. We will use the terminology

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems

Reinforcement Learning

MTH 146 Class 11 Notes

Transforms II - Wavelets Preliminary version please report errors, typos, and suggestions for improvements

Chapter 2. First Order Scalar Equations

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

OBJECTIVES OF TIME SERIES ANALYSIS

1. Introduction. 1 b b

Learning to Take Concurrent Actions

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

2D Motion WS. A horizontally launched projectile s initial vertical velocity is zero. Solve the following problems with this information.

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Chapter 7 Response of First-order RL and RC Circuits

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Anno accademico 2006/2007. Davide Migliore

Forms of Energy. Mass = Energy. Page 1. SPH4U: Introduction to Work. Work & Energy. Particle Physics:

5.1-The Initial-Value Problems For Ordinary Differential Equations

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Physics 101 Lecture 4 Motion in 2D and 3D

Review of Calculus, cont d

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES

LAPLACE TRANSFORMS. 1. Basic transforms

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

CSC 373: Algorithm Design and Analysis Lecture 9

graph of unit step function t

Endogenous Formation of Limit Order Books: Dynamics Between Trades.

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

Administrivia CSE 190: Reinforcement Learning: An Introduction

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

ON THE DYNAMICS AND THERMODYNAMICS OF SMALL MARKOW-TYPE MATERIAL SYSTEMS. Andrzej Trzęsowski

Transcription:

einforcemen Lerning Mrkov Decision rocesses Mnfred Huber 2014 1

equenil Decision Mking N-rmed bi problems re no good wy o model sequenil decision problem Only dels wih sic decision sequences Could be miiged by dding ses (which would furher increse number smples needed Need beer model for sequenil decision sks Mnfred Huber 2014 2

Mrkov Decision rocesses Mrkov Decision rocesses (MD) re more comprehensive model Inroduces he concep se o describe he inernls n execued sequence Allows for condiionl cion sequences Models he underlying process s probbilisic sequence ses wih ssocied rewrds Mnfred Huber 2014 3

A Agen wrd 44 cion +1 +1 A CHATE 3. THE EINFOCEMEN se rewrd Agen UMMAY OF NOTATION +1 +1 Envi se rewrd UMMAY OF NOTATION The Agen-Environmen Inerfce equenil Decision Mking ummry Noion Environmen +1 Figure OF 3.1:NOTATION The gen environmen UMMAY ummry Noion +1 Environmen en environmen inercion in reinforcemen lerning. Agen Cpil leers re used for rom vri Figure 3.1: The ummry re Noion Cpil leers re used forinercion rom Lower cse leers used for he vluei gives rise ogen environmen rewrds, specil numeric se rewrd cion Lower cse leers re used for he v funcions. Quniies h re required specil numericl vlues h he gen ries o mximize over ime. A complee specificion A funcions. Quniies h re requir ee specificion n environmen+1defines sk, gives one in bold in cse (even if lerning rom rise o rewrds, specil numericl vlues h insnce lower he reinforcemen Cpil leers re used for rom vri over ime. A complee specificion n if enviro bold in re lower cse (even rn orcemen lerning problem. +1 Environmen Lowerin cse leers used for he vlue specificlly,lerning he gen en insnce More he reinforcemen problem. s discree se funcions. Quniies h re he gen environmen inerc ech sequence ime seps, = 0,required 1, 2, 3,... he gen environmen in s specificlly, More cion in bold in se lower cse(even if rom, = 0, 1, 2, 3,....2 AAgen ech imeenvironmen sep, he gen receives some represenion inerc discree ime 2,3, Khe discree imeseps: seps, = 0, 1, 2,....2 environm A ech im cion se ll nonerminl ses n To ddress sequenil decisions wih he environmen s se,, where is he se represenion possible ses, on h bsis he environmen s se,sel + Agen observes se sep : some! se ll nonerminl se se ll ses, including he on h bsis selecs n cion, A A( ), where A( hese seoncions vilble incion s possible ses, h bsis selecs n ) is + s condiionl cion need se we ll ses, including produces sepin :choices A!is ( )se se secions possible se he cion cions vilble in sehe in One irs vilble in se. One ime cion sep ler, pr s consequence. gen is cion, A(s) A(s) is cion, se he cions possible in s consequence gen receives rew nume cion, he gen receives numericl rewrd, ges resuling rewrd:!, where is he se possible +1 +1 required iself inse ll nonerminl ses n e represens he informion 3bou finds new se,. Figure 3.1 digr 3 +1 discree ime sep +. Figure 3.1 digrms he gen enviro possible rewrds, finds iself innex new +1 resuling se:se, +1! se discree ll ses, he inercion. imeincluding sep he curren world/gen =s0, 0finl, s1, ime Tωconfigurion 1,...sep n episode he gen environmen inercion. A(s) se cions possible in se s Aime ech ime sep, hegen imp T finl ime sep n episod A ech sep, he gen implemens m se be differen from observble informion p, he gen implemensn Cn mpping from ses o probbiliies se selecing ech possible c re biliies selecing ech possible This m The oher rom vribles cion. funcion +1 +2 +3...... A cion +1 +2 +3 Ar+2+1 sep ech possible cion. This mpping herge gen s policy πime is heπ policy is denoed A is clled A+1 A, where A isis discree dependen, +3cion ( s denoed funcion sπ,π( s), where, sprob +1 rewrd, like., einforcemen lerning mehods specify how he Mnfred Huber 2014 4 d π, where π ( s) is he probbiliy h A = iferminl = T s. einforcemen imepredicion sep dependen, n lerning mehods spe finl rewrd, lik rge z, y,episode re funci

insnce he reinforcemen ler s discree se funcions. Quniies h en environmen inerc ech sequence ime seps, More specificlly, he gen n s se 2 cion in bold in lower cse(e 1, 2, 3,.... AAgen ech imeenvironmen sep, he gen receives some represenion inerc discree ime seps: 0, 2,3K discree ime seps, = 1, 2, cion se ll nonerm environmen s se,, where is he se represenion possible ses, on he enviro + Agen observes se sep : some! sellses, ll noni ) se bsis selecs n cion, A A( ), where A( is he se cions v possible ses, on h bsis s se + se secions ll s produces sepin :pr A!is ( )se A(s) pos he cions vilble in ble in se. One ime cion sep ler, s consequence is cio cion A(s) is cion, se he cions consequence gen he gen receives ges numericl resuling rewrd rewrd:, +1!, where is he se +1 se ll nonerm 3 finds iself in new se,. 3 +1 +. Figurese discree ime sep sible rewrds, finds iselfcn innex be new se, +1 3.1digrms hi resuling se: +1! ll ses, n Execuions represened s se/ inercion. discree ime, s1, ime Tω = s0, 0finl 1,...sep en environmen inercion. se cions pos Aime ech ime sep, T finl ime se cion/rewrd sequences A(s) A ech sep, he gen se gen implemens mppingfrom ses+2 o probbiliies se selecing e +3rom re biliies selecing ech possible The oher vribles +1...... A cion +1 +2policy is+3 Ar+2+1 discree sep ossible cion. This mpping herge gen s denoed πime policy is denoed A is clled A+1 A, where A +3cion is funcion sπ,π rewrd, depe lerning mehods ere π ( s) is he probbiliy h A = iferminl einforcemen = T s. finl imepredicio sep einforcemen lerning rewrd, rge z, G reurn (cumuliv n To model sysems we need o know how ses resul isse experience. The g uon A. G. Bro: einforcemen Lerning: An Inroducion hods specify how he gen chnges is policy s resul is experienc G reurn (cumu (n) he ol moun rewrd i (e rec G (n) n-sep reurn cions releiso ses rewrds (n) The gen s gol, roughly speking, o oucome mximize ol moun + rew r+1 + β+1 ( Aλ he cion z +1 G= n-sep reurn G λ-reurn (ecion This frmework is bsrc λ i receives overn he long run. Mrkov Models re srong powerful rewrd, depe G λ-reurn (ec This frmework is b problems mny differen wys. (0) in = yreurn frmework o model he dynmics such sysems G (cumuliv problems c flexible cn be pplied o mny differen in mny diffe o fixed inervls ime; he π(n) policy,rel decision-m π policy, decisi n Mrkov Assumpion: decision-mking cing. The G n-sep reurn (e wys. For exmple, he ime seps need no refer inervls in re o fixed π(s) cion ken s (n) λ λ decision-mking n 1 volges pplied ocion he moors o π(s) ken me; hey cn refer o rbirry successive sges G λ-reurn (ecion c (s s,, s,..., s ) = (s s, ) = (1 λ) λ π( s) probbiliy k!1!1!1 1!1!1 s wheher or no o hve luncho π( s) probbiliy The cions cn be low-level conrols, such s he volges pplied o he n=1 Mnfred Huber 2014 5 p(s s, ) probbiliy r ses cn ke wide vriey fo equenil Decision Mking

Mrkov Decision roblems Fully Observble Mrkov Decision roblems cn be formuled s Acion se e se e rnsiion probbiliie e-dependen expeced rewrds A = i = s j { } { } ewrds cn be probbilisic Mrkov ssumpion wih rewrd: T : (s i s j, ) (r s, ) A,,T, : r(s, ) (r, s s!1,!1, s!1,..., s 1 ) = (r, s s!1,!1 ) Mnfred Huber 2014 6

Mrkov Decision rocesses 1 6 1 s 0 s 1 2 3 7 s 3 4 s 2 3 7 5 1 s 5 5 2 s 4 s 7 s 6 7 9 8 1 4 Mnfred Huber 2014 7

Mrkov Decision rocesses Mrkov Decision rocesses re Mrkov models wih muliple choices for rnsiion probbiliies in ech se Mrkov Chin is he specil cse where here is only one cion choice in ech se ( hus no decisions re required MD model cn be considered s represenion ll Mrkov chins for ll possible policies Mnfred Huber 2014 8

Mrkov Decision rocesses Mrkov Decision rocesses represen decision mking on Mrkov model Fully observble: curren se is known - MD rilly observble: curren se cn only be indirecly observble OMD In Mrkov Decision rocesses decisions cn be represened s policies Fully observble: Deerminisic policies:! (s, ) = ( s)! (s) = Mnfred Huber 2014 9

Mrkov Decision rocesses Mrkov Decision rocesses re very generl modeling frmework Mos sysems hve represenion in which he Mrkov propery holds e only hs o conin wh mkes i Mrkov Mrkov-n sysems hve n equivlen Mrkov model Mny sysems cn be modeled s fully observble Fully observble does no men h everyhing is known (only he se) Mnfred Huber 2014 10

Mrkov Decision rocesses Mrkov Decision rocesses cn hve ermining ses Terminion cn be equivlenly modeled by inroducing erminl se (which is iself nonermining) which loops o iself wih rewrd 0 Every se h would ermine links o his se for every cion wih probbiliy 1 ewrd his rnsiion is he rewrd he ermining se s =0 =1 Mnfred Huber 2014 11 s =1 =0

Mrkov Decision rocesses For mhemicl nlysis we cn simplify MDs ino equivlen MDs where: e rnsiion probbiliies re condiionlly independen he rewrd probbiliies T : (s i s j, ) ewrd probbiliies only depend on he se is deerminisic (s) =! r (r s)r For prcicl sysems, is en modeled bsed on s o reduce se spce size Mnfred Huber 2014 12

Designing MDs Mos imporn pr is o design n pproprie se cion spce es do no hve o represen every spec Only Mrkov ropery hs o hold Acions cn be low level or high level do no need o ke equl mouns ime o execue MDs cn be even-driven Absrc represenions resul in smller MD Usully fser lernble Beer generlizion Mnfred Huber 2014 13

Designing MDs Tsks for gens cn usully be chrcerized by gols objecives Gols in AI usully refer o condiions h hve o be me in order o chieve he sk Gols cn be represened by se ses Objecives refer o properies h hve o be opimized bu migh no hve definie oucomes Objecives re nively chrcerized by uiliies In MDs ll sks hve o be represened in erms sclr rewrd funcion Mnfred Huber 2014 14

Designing MDs Gols cn be mpped ino rewrd funcion E.g.: posiive rewrd in ech gol se no rewrd in oher ses Objecives cn be mpped ino rewrd: E.g.: ssign o ech se he incremenl chnge in oucome uiliy if ermining in his se. Gols objecives cn be mixed In he resuling sysem he gol migh no longer be reched due o he influence he objecive Mnfred Huber 2014 15

Designing MDs Tsks in MDs re defined by muliple properies: ewrd funcion Uiliy/reurn definiion Discoun fcor for discouned fuure rewrd uiliies Chnging one hese properies poenilly chnges he sk o be solved hus he lerned policy Mnfred Huber 2014 16

Designing MDs Exmple ysem: Mobile robo moving on 3x3 grid wih fixed obscles obo uses energy for ech move bu cn urn iself f Tsk: Moves cn only be horizonl vericl by one cell ech gol locion while minimizing energy usge Mnfred Huber 2014 17

Designing MDs es: obo X Y coordine Acions: No need o include obscle locions since hey re fixed Lef, igh, Up, Down, Off ewrd: Gol reching: g (s): +r g gol, 0 oherwise Energy: e (): -r e for L,, U, D, 0 for O Tol rewrd: (s, ) = g (s)+ e () Mnfred Huber 2014 18

Designing MDs Trnsiion probbiliies: Trnsiion probbiliies encode how he cions work For deerminisic cions hey probbiliies re 1 0 Uiliy choice: Discouned sum fuure rewrds Noe: wh sk will be solved ( wheher he gen emps o rech he gol) depends on r g, r e, he discoun fcor Mnfred Huber 2014 19

Designing MDs Gol in 3,3 Obscles in 1,2 1, 3 Gol se Terminl se V(s g )=r g Trnsiions o erminl se ewrd on rnsiion: r g ewrd in loop is 0 L,,U,D: =-r e O:=0 L,,U,D: =-r e O:=0 D: =-r e L: =-r e O:=0 D: =-r e : =-r e s T L,U: =-r e O:=0 L,,U,D,O: =r g : =-r e : =-r e : =-r e L: =-r e L: =-r e L,,U,D,O: =0 V=r g U: =-r e U: =-r e : =-r e O:=0 L: =-r e U: =-r e D: =-r e U: =-r e Mnfred Huber 2014 L,U,D: =-r e O:=0 D: =-r e O:=0,D: =-r e O:=0 20

Designing MDs olicy does no ensure reching he gol olicy opimizes rdef cos benefi (reching gol) obo migh urn iself f if he wy o he gol is oo long Opiml policy depends on Choice discoun fcor Choice r e r g Uiliy ccumulion discoun fcor re pr he definiion he sk Mnfred Huber 2014 21

Discoun Fcor Terminion robbiliies There re wo inerpreions he discoun fcor: Discoun fcor indices how much less we vlue fuure rewrds compred o curren ones Vlue ccumulion is undiscouned discoun fcor indices he probbiliy h he sk will no be romly ermined in ech se Discouning due o he likelihood h no fuure rewrd cn be obined Mnfred Huber 2014 22

Mrkov Decision rocesses Mrkov Decision rocesses provide relively generl modeling frmework for gen/environmen sysem inercion es rnsiion probbiliies model process dynmics Acions model he rnge possible gen choices ewrd represens locl gen feedbck Togeher wih vlue ccumulion model discoun fcor defines he sk OMD exension where he se is no fully observble ionl gen emps o find n opiml policy Mximize expeced ccumulive vlue Mnfred Huber 2014 23