Reinforcement Learning

Similar documents
Reinforcement Learning

Chapter 2: Evaluative Feedback

LOCUS 1. Definite Integration CONCEPT NOTES. 01. Basic Properties. 02. More Properties. 03. Integration as Limit of a Sum

ECE 636: Systems identification

Week 8 Lecture 3: Problems 49, 50 Fourier analysis Courseware pp (don t look at French very confusing look in the Courseware instead)

ONE RANDOM VARIABLE F ( ) [ ] x P X x x x 3

Existence Of Solutions For Nonlinear Fractional Differential Equation With Integral Boundary Conditions

STK4080/9080 Survival and event history analysis

NOTES ON BERNOULLI NUMBERS AND EULER S SUMMATION FORMULA. B r = [m = 0] r

e t dt e t dt = lim e t dt T (1 e T ) = 1

SLOW INCREASING FUNCTIONS AND THEIR APPLICATIONS TO SOME PROBLEMS IN NUMBER THEORY

Moment Generating Function

N! AND THE GAMMA FUNCTION

( a n ) converges or diverges.

ERROR ESTIMATES FOR APPROXIMATING THE FOURIER TRANSFORM OF FUNCTIONS OF BOUNDED VARIATION

Transient Solution of the M/M/C 1 Queue with Additional C 2 Servers for Longer Queues and Balking

F.Y. Diploma : Sem. II [CE/CR/CS] Applied Mathematics

MA123, Chapter 9: Computing some integrals (pp )

A Normative Theory of Forgetting: Lessons from the Fruit Fly

SUTCLIFFE S NOTES: CALCULUS 2 SWOKOWSKI S CHAPTER 11

On Absolute Indexed Riesz Summability of Orthogonal Series

HOMEWORK 6 - INTEGRATION. READING: Read the following parts from the Calculus Biographies that I have given (online supplement of our textbook):

Supplement: Gauss-Jordan Reduction

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

8.3 Sequences & Series: Convergence & Divergence

Comparison between Fourier and Corrected Fourier Series Methods

Linear Programming. Preliminaries

Linear Time Invariant Systems

Notes 03 largely plagiarized by %khc

0 otherwise. sin( nx)sin( kx) 0 otherwise. cos( nx) sin( kx) dx 0 for all integers n, k.

Calculus Limits. Limit of a function.. 1. One-Sided Limits...1. Infinite limits 2. Vertical Asymptotes...3. Calculating Limits Using the Limit Laws.

POWER SERIES R. E. SHOWALTER

S n. = n. Sum of first n terms of an A. P is

Chapter 7 Infinite Series

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Convergence rates of approximate sums of Riemann integrals

th m m m m central moment : E[( X X) ] ( X X) ( x X) f ( x)

5. Solving recurrences

Suggested Solution for Pure Mathematics 2011 By Y.K. Ng (last update: 8/4/2011) Paper I. (b) (c)

Decompression diagram sampler_src (source files and makefiles) bin (binary files) --- sh (sample shells) --- input (sample input files)

A Kalman filtering simulation

Bellman Optimality Equation for V*

Ideal Amplifier/Attenuator. Memoryless. where k is some real constant. Integrator. System with memory

Chapter 2 Infinite Series Page 1 of 9

Sequence and Series of Functions

Lecture 15 First Properties of the Brownian Motion

ECE-314 Fall 2012 Review Questions

Review of Sections

The limit comparison test

Extremal graph theory II: K t and K t,t

MTH 146 Class 16 Notes

Reinforcement learning

1. Solve by the method of undetermined coefficients and by the method of variation of parameters. (4)

Convergence rates of approximate sums of Riemann integrals

PROGRESSIONS AND SERIES

L-functions and Class Numbers

Extension of Hardy Inequality on Weighted Sequence Spaces

Geometric Sequences. Geometric Sequence. Geometric sequences have a common ratio.

Review Exercises for Chapter 9

Math 6710, Fall 2016 Final Exam Solutions

Section 6.3: Geometric Sequences

1.3 Continuous Functions and Riemann Sums

David Randall. ( )e ikx. k = u x,t. u( x,t)e ikx dx L. x L /2. Recall that the proof of (1) and (2) involves use of the orthogonality condition.

We will begin by supplying the proof to (a).

ON BILATERAL GENERATING FUNCTIONS INVOLVING MODIFIED JACOBI POLYNOMIALS

LIMITS OF FUNCTIONS (I)

Big O Notation for Time Complexity of Algorithms

SUTCLIFFE S NOTES: CALCULUS 2 SWOKOWSKI S CHAPTER 11

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

B. Examples 1. Finite Sums finite sums are an example of Riemann Sums in which each subinterval has the same length and the same x i

THE GENERALIZED WARING PROCESS

BEST LINEAR FORECASTS VS. BEST POSSIBLE FORECASTS

INVESTMENT PROJECT EFFICIENCY EVALUATION

Introduction to mathematical Statistics

Review for the Midterm Exam.

TEST-12 TOPIC : SHM and WAVES

Special Functions. Leon M. Hall. Professor of Mathematics University of Missouri-Rolla. Copyright c 1995 by Leon M. Hall. All rights reserved.

Using Compaction to Expand the Curriculum and Extend Learning

REAL ANALYSIS I HOMEWORK 3. Chapter 1

OLS bias for econometric models with errors-in-variables. The Lucas-critique Supplementary note to Lecture 17

Localization. MEM456/800 Localization: Bayes Filter. Week 4 Ani Hsieh

Experiment 6: Fourier Series

In an algebraic expression of the form (1), like terms are terms with the same power of the variables (in this case

4.8 Improper Integrals

f(bx) dx = f dx = dx l dx f(0) log b x a + l log b a 2ɛ log b a.

Review Answers for E&CE 700T02

September 20 Homework Solutions

Review of the Riemann Integral

ECE 350 Matlab-Based Project #3

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Taylor Polynomials. The Tangent Line. (a, f (a)) and has the same slope as the curve y = f (x) at that point. It is the best

Remarks: (a) The Dirac delta is the function zero on the domain R {0}.

Solutions to selected problems from the midterm exam Math 222 Winter 2015

Math 2414 Homework Set 7 Solutions 10 Points

1 Notes on Little s Law (l = λw)

General properties of definite integrals

CLOSED FORM EVALUATION OF RESTRICTED SUMS CONTAINING SQUARES OF FIBONOMIAL COEFFICIENTS

BINOMIAL THEOREM OBJECTIVE PROBLEMS in the expansion of ( 3 +kx ) are equal. Then k =

14.02 Principles of Macroeconomics Fall 2005

Transcription:

Reiforceme Corol lerig Corol polices h choose opiml cios Q lerig Covergece Chper 13 Reiforceme 1

Corol Cosider lerig o choose cios, e.g., Robo lerig o dock o bery chrger o choose cios o opimize fcory oupu o ply Bckgmmo Noe severl problem chrcerisics Delyed rewrd Opporuiy for cive explorio Possibiliy h se oly prilly observble Possible eed o ler muliple sks wih sme sesors/effecors Chper 13 Reiforceme 2

Oe Exmple: TD-Gmmo Tesuro, 1995 Ler o ply Bckgmmo Immedie rewrd +100 if wi -100 if lose 0 for ll oher ses Tried by plyig 1.5 millio gmes gis iself Now pproximely equl o bes hum plyer Chper 13 Reiforceme 3

Reiforceme Problem Evirome se cio Age rewrd s 0 0 s 1 1 s 2 2... r0 r1 r2 Gol: ler o choose cios h mximize r 0 + γr 1 + γ 2 r 2 +, where 0 γ< 1 Chper 13 Reiforceme 4

Mrkov Decisio Process Assume fiie se of ses S se of cios A ech discree ime, ge observes se s S d choose cio A he receives immedie rewrd r d se chges o s +1 Mrkov ssumpio: s +1 = δ(s, ) d r = r(s, ) i.e., r d s +1 deped oly o curre se d cio fucios δ d r my be odeermiisic fucios δ d r o ecessrily kow o ge Chper 13 Reiforceme 5

Age s Tsk Execue cio i evirome, observe resuls, d ler cio policy π : S A h mximizes E[r + γr +1 + γ 2 r +2 + ] from y srig se i S here 0 γ< 1 is he discou fcor for fuure rewrds Noe somehig ew: rge fucio is π : S A bu we hve o riig exmples of form <s,> riig exmples re of form <<s,>,r> Chper 13 Reiforceme 6

Vlue Fucio To begi, cosider deermiisic worlds For ech possible policy π he ge migh dop, we c defie evluio fucio over ses V π ( s) i= 0 + γ γr i r + 1 + i + + 2 where r,r +1, re geered by followig policy π srig se s Resed, he sk is o ler he opiml policy π* π* r rgmx V π π γ 2 r ( s),( s) +... Chper 13 Reiforceme 7

0 100 0 0 0 0 0 0 0 G 0 100 72 81 9081 8172 100 90 81 G 0 100 0 r(s, (immedie rewrd) vlues 0 81 90 Q(s, vlues 90 100 G 0 G 81 90 100 V*(s) vlues Oe opiml policy Chper 13 Reiforceme 8

Wh o Ler We migh ry o hve ge ler he evluio fucio V π* (which we wrie s V*) We could he do lookhed serch o choose bes cio from y se s becuse π* (s) rgmx [ r(s, + γ V*( δ (s,)] A problem: This works well if ge kows δ : S A S, d r : S A R Bu whe i does, we c choose cios his wy Chper 13 Reiforceme 9

Q Fucio Defie ew fucio very similr o V* Q( s, r(s, + γ If ge lers Q, i c choose opiml cio eve wihou kowig d! π* π* (s) (s) rgmx [ r(s, + γ V*( δ (s,)] rgmxq( s, V*( δ (s,) Q is he evluio fucio he ge will ler Chper 13 Reiforceme 10

Triig Rule o Ler Q Noe Q d V* closely reled: V *(s) = mx Q( s, ) Which llows us o wrie Q recursively s Q( s, ) = = r(s r(s,, ) + γ ) + γ V*( δ mxq( s + 1, Le deoe lerer s curre pproximio o Q. Cosider riig rule ( s, r + γ where s' is he se resulig from pplyig cio i se s (s mx ( s, ), )) ) Chper 13 Reiforceme 11

Q for Deermiisic Worlds For ech s, iiilize ble ery Observe curre se s Do forever: Selec cio d execue i Receive immedie rewrd r Observe he ew se s' Upde he ble ery for s follows: s s' ( s, r + γ mx ( s, ) Q ˆ( s, ) ( s, 0 Chper 13 Reiforceme 12

Updig 63 63 R 100 R 100 72 81 90 81 righ ( s oice if d 1, iiil se: s 1 righ ) rewrds o - egive, he ( s,, ) r + γ + 1 ( s,, ) 0 ( s, mx ( s ( s, 2, ) 0 + 0.9 mx{63,81,100} = 90 ( s, Q( s, ex se: s 2 Chper 13 Reiforceme 13

Covergece coverges o Q. Cosider cse of deermiisic world where ech <s,> visied ifiiely ofe. Proof: defie full iervl o be iervl durig which ech <s,> is visied. Durig ech full iervl he lrges error i ble is reduced by fcor of γ Le be ble fer updes, d be he mximum error i ; h is = mx s, ˆ Q ( s, Q( s, Chper 13 Reiforceme 14

Covergece (co) ( s, ) For y ble ery upded o ierio +1, he error i he revised esime ( s, is ( s, Q( s, = (r + γ mxq ˆ (s, )) (r + γ mxq(s, )) + 1 + 1 ( s, Q( s, = γ mxq ˆ (s, ) mxq(s, ) Noe we used geerl fc h γ mx Q ˆ (s, ) Q(s, ) γ mx Q ˆ (s, ) Q(s, ) s, γ mx f (- mx 1 f 2 ( mx f (- 1 f 2 ( Chper 13 Reiforceme 15

Nodeermiisic Cse Wh if rewrd d ex se re o-deermiisic? We redefie V,Q by kig expeced vlues V π (s) E[ r + γ r + 1 + γ 2 r + 2 +...] E [ ] i γ r i= 0 + i Q( s, E[ r( s, + γ V *( δ (s,)] Chper 13 Reiforceme 16

Nodeermiisic Cse Q lerig geerlizes o odeermiisic worlds Aler riig rule o ( s, (1 α ) ˆ 1( s, + α [ r + mx Q 1( s, )] where 1 α = 1+ visis ( s, C sill prove coverge of o Q [Wkis d Dy, 1992] Chper 13 Reiforceme 17

Temporl Differece Q lerig: reduce discrepcy bewee successive Q esimes Oe sep ime differece: Q (1) Why o wo seps? Q (2) Or? Q ( ) ( s, ) r + γ mx ( s+ 1, Bled ll of hese: Q λ 2 ( s, ) r + γ r + + 1 γ mx ( s+ 2 ( s ( s,, ) ) r + γ (1 λ ) -1 r + 1 +... + γ r + 1 + γ, mx ( s +, [ (, ) λ (, ) λ (, )...] (1) (2) 2 (3) Q s + Q s + Q s + Chper 13 Reiforceme 18

Temporl Differece Q λ ( s, ) (1 λ ) [ (, ) λ (, ) λ (, )...] (1) (2) 2 (3) Q s + Q s + Q s + Equivle expressio: Q λ [ ] λ (1 λ ) mx ( s, ) + λ Q ( s, ) ( s, ) r + γ + 1 + 1 TD(λ) lgorihm uses bove riig rule Someimes coverges fser h Q lerig coverges for lerig V* for y 0 λ 1 (Dy, 1992) Tesuro s TD-Gmmo uses his lgorihm Chper 13 Reiforceme 19

Subleies d Ogoig Reserch Replce ble wih eurl ework or oher geerlizer Hdle cse where se oly prilly observble Desig opiml explorio sregies Exed o coiuous cio, se Ler d use d : S A S, d pproximio o δ Relioship o dymic progrmmig Chper 13 Reiforceme 20

RL Summry Reiforceme lerig (RL) corol lerig delyed rewrd possible h he se is oly prilly observble possible h he relioship bewee ses/cios ukow Temporl Differece ler discrepcies bewee successive esimes used i TD-Gmmo V(s) - se vlue fucio eeds kow rewrd/se rsiio fucios Chper 13 Reiforceme 21

RL Summry Q(s, - se/cio vlue fucio reled o V does o eed rewrd/se rs fucios riig rule reled o dymic progrmmig mesure cul rewrd received for cio d fuure vlue usig curre Q fucio deermiisic - replce exisig esime odeermiisic - move ble esime owrds mesure esime covergece - c be show i boh cses Chper 13 Reiforceme 22