Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Similar documents
Chapter 2: Evaluative Feedback

Reinforcement Learning

e t dt e t dt = lim e t dt T (1 e T ) = 1

Reinforcement learning

A Kalman filtering simulation

Reinforcement Learning. Markov Decision Processes

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

Minimum Squared Error

Minimum Squared Error

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

4.8 Improper Integrals

Contraction Mapping Principle Approach to Differential Equations

Bellman Optimality Equation for V*

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor.

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

0 for t < 0 1 for t > 0

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Chapter Direct Method of Interpolation

S Radio transmission and network access Exercise 1-2

REAL ANALYSIS I HOMEWORK 3. Chapter 1

Reinforcement learning II

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Probability, Estimators, and Stationarity

Physics 2A HW #3 Solutions

Administrivia CSE 190: Reinforcement Learning: An Introduction

3. Renewal Limit Theorems

September 20 Homework Solutions

Chapter 4: Dynamic Programming

Reinforcement Learning

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

{ } = E! & $ " k r t +k +1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Average & instantaneous velocity and acceleration Motion with constant acceleration

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

1.0 Electrical Systems

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

15/03/1439. Lecture 4: Linear Time Invariant (LTI) systems

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

( ) ( ) ( ) ( ) ( ) ( y )

5.1-The Initial-Value Problems For Ordinary Differential Equations

Forms of Energy. Mass = Energy. Page 1. SPH4U: Introduction to Work. Work & Energy. Particle Physics:

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

MTH 146 Class 11 Notes

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

Solutions to Problems from Chapter 2

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

ECE Microwave Engineering. Fall Prof. David R. Jackson Dept. of ECE. Notes 10. Waveguides Part 7: Transverse Equivalent Network (TEN)

A Simple Method to Solve Quartic Equations. Key words: Polynomials, Quartics, Equations of the Fourth Degree INTRODUCTION

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

Physic 231 Lecture 4. Mi it ftd l t. Main points of today s lecture: Example: addition of velocities Trajectories of objects in 2 = =

A new model for limit order book dynamics

f(x) dx with An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples dx x x 2

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

ANSWERS TO EVEN NUMBERED EXERCISES IN CHAPTER 2

2D Motion WS. A horizontally launched projectile s initial vertical velocity is zero. Solve the following problems with this information.

Phys 110. Answers to even numbered problems on Midterm Map

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ]

Machine Learning Reinforcement Learning

NMR Spectroscopy: Principles and Applications. Nagarajan Murali Advanced Tools Lecture 4

On the Pseudo-Spectral Method of Solving Linear Ordinary Differential Equations

Neural assembly binding in linguistic representation

Version 001 test-1 swinney (57010) 1. is constant at m/s.

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL

PART V. Wavelets & Multiresolution Analysis

Name: Per: L o s A l t o s H i g h S c h o o l. Physics Unit 1 Workbook. 1D Kinematics. Mr. Randall Room 705

Introduction to LoggerPro

ECE Microwave Engineering

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

Mathematics 805 Final Examination Answers

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

Aho-Corasick Automata

A LOG IS AN EXPONENT.

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

Some Inequalities variations on a common theme Lecture I, UL 2007

IX.1.1 The Laplace Transform Definition 700. IX.1.2 Properties 701. IX.1.3 Examples 702. IX.1.4 Solution of IVP for ODEs 704

3 Motion with constant acceleration: Linear and projectile motion

Presentation Overview

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Section P.1 Notes Page 1 Section P.1 Precalculus and Trigonometry Review

graph of unit step function t

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

6. Gas dynamics. Ideal gases Speed of infinitesimal disturbances in still gas

1 Sterile Resources. This is the simplest case of exhaustion of a finite resource. We will use the terminology

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

1 Review of Zero-Sum Games

Tax Audit and Vertical Externalities

Laplace Transforms. Examples. Is this equation differential? y 2 2y + 1 = 0, y 2 2y + 1 = 0, (y ) 2 2y + 1 = cos x,

THREE IMPORTANT CONCEPTS IN TIME SERIES ANALYSIS: STATIONARITY, CROSSING RATES, AND THE WOLD REPRESENTATION THEOREM

Lecture Notes 2. The Hilbert Space Approach to Time Series

RESPONSE UNDER A GENERAL PERIODIC FORCE. When the external force F(t) is periodic with periodτ = 2π

Exact Minimization of # of Joins

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Transcription:

Mking Comple Decisions Mrkov Decision Processes Vsn Honvr Bioinformics nd Compuionl Biology Progrm Cener for Compuionl Inelligence, Lerning, & Discovery honvr@cs.ise.edu www.cs.ise.edu/~honvr/ www.cild.ise.edu/ www.bcb.ise.edu/ www.iger.ise.edu Vsn Honvr, 006. Mking Comple Decisions: Mrkov Decision Problem How o use knowledge bou he world o mke decisions when here is unceriny bou consequences of cions Rewrds re delyed Vsn Honvr, 006. The Soluion Sequenil decision problems in uncerin environmens cn be solved by clculing policy h ssocies n opiml decision wih every environmenl se Mrkov Decision Process (MDP) Vsn Honvr, 006.

Emple The world 3 + Acions hve uncerin consequences 0.8-0. 0. sr 3 4 Vsn Honvr, 006. Vsn Honvr, 006. Vsn Honvr, 006.

Vsn Honvr, 006. Vsn Honvr, 006. Vsn Honvr, 006. 3

Vsn Honvr, 006. Cumulive Discouned Rewrd Suppose rewrds re bounded by M Cumulive discouned rewrd is bounded by M + M n+ n ( ) ( ) ( γ ) γ +.. M γ = M ( γ ) Noe :For he geomeric series o converge, 0 γ < Vsn Honvr, 006. Uiliy of Se Sequence U h U h g Addiive rewrds ([ s, s, s...]) = R( s0) + R( s) + R( s) 0 + g Discouned rewrds ([ s, s, s...]) = R( s0) + γr( s) + γ R( s) 0 +...... Vsn Honvr, 006. 4

Vsn Honvr, 006. Uiliy of Se g g The uiliy of ech se is he epeced sum of discouned rewrds if he gen eecues he policy π U π = ( s) E γ R( s ) π, s = 0 = s The rue uiliy of se corresponds o he opiml policy π* 0 Vsn Honvr, 006. Vsn Honvr, 006. 5

Clculing he Opiml Policy Vlue ierion Policy ierion Vsn Honvr, 006. Vlue Ierion g Clcule he uiliy of ech se g Then use he se uiliies o selec n opiml cion in ech se * π ( s) = rg m / s / / T ( s,, s ) U ( s ) Vsn Honvr, 006. Vlue Ierion Algorihm funcion vlue-ierion(mdp) reurns uiliy funcion locl vribles: U, U iniilly idenicl o R repe U U for ech se s do U ( s) R( s) + γ end unil close-enough(u, U ) reurn U m / s / / T ( s,, s ) U ( s ) Bellmn upde Vsn Honvr, 006. 6

Vlue Ierion Algorihm: Emple 3 0.8 0.868 0.9 + 0.76 0.660-0.705 0.655 0.6 0.388 3 4 The Uiliies of he Ses Obined Afer Vlue Ierion Vsn Honvr, 006. Policy Ierion Pick policy, hen clcule he uiliy of ech se given h policy (vlue deerminion sep) Upde he policy ech se using he uiliies of he successor ses Repe unil he policy sbilizes Vsn Honvr, 006. Vsn Honvr, 006. Policy Ierion Algorihm funcion policy-ierion(mdp) reurns policy locl vribles: U, uiliy funcion, π, policy repe U vlue-deerminion(π,u,mdp,r) unchnged? rue for ech se s do / / / if m T ( s,, s ) U ( s ) > T ( s, π ( s), s / / s s π ( s) T ( s,, s rg m unchnged? flse end unil unchnged? reurn π / s / ) U ( s ) hen / / ) U ( s ) 7

Vlue Deerminion g Simplificion of he vlue ierion lgorihm becuse he policy is fied g Liner equions becuse he m() operor hs been removed g Solve ecly for he uiliies using sndrd liner lgebr Vsn Honvr, 006. Opiml Policy (policy ierion wih liner equions) 3 + - 3 4 u (,) = 0.8 u (,) + 0. u (,) +0. u (,) Vsn Honvr, 006. u (,) = 0.8 u (,3) + 0. u (,) Prilly observble MDP (POMDP) In n inccessible environmen, he percep does no provide enough informion o deermine he se or he rnsiion probbiliy POMDP Se rnsiion funcion: P(s + s, ) Observion funcion: P(o s, ) Rewrd funcion: E(r s, ) Approch Clcule probbiliy disribuion over he possible ses given ll previous perceps, nd o bse decision on his disribuion Vsn Honvr, 006. 8

Vsn Honvr, 006. Lerning from Inercion wih he world An gen receives sensions or perceps from he environmen hrough is sensors nd cs on he environmen hrough is effecors nd occsionlly receives rewrds or punishmens from he environmen The gol of he gen is o mimize is rewrd (plesure) or minimize is punishmen (or pin) s i sumbles long in n -priori unknown, uncerin, environmen Supervised Lerning Eperience = Lbeled Emples Inpus Supervised Lerning Sysem Oupus Objecive Minimize Error beween desired nd cul oupus 9

Reinforcemen Lerning Eperience = Acion-induced Se Trnsiions nd Rewrds Inpus Reinforcemen Lerning Sysem Oupus = cions Objecive Mimize rewrd Reinforcemen lerning Lerner is no old which cions o ke Rewrds nd punishmens my be delyed Scrifice shor-erm gins for greer long-erm gins The need o rdeoff beween eplorion nd eploiion Environmen my no be observble or only prilly observble Environmen my be deerminisic or sochsic Reinforcemen lerning Environmen se rewrd cion Agen 0

.................. Key elemens of n RL Sysem Policy Rewrd Vlue Model of environmen Policy wh o do Rewrd wh is good Vlue wh is good becuse i predics rewrd Model wh follows wh An Eended Emple: Tic-Tc-Toe o X X X X X O X O X O X O X O O......... o... X O X O X O o X O X O X O X............... } O moves } X moves } O moves } X moves Assume n imperfec opponen: he/she someimes mkes miskes } X moves o o A Simple RL Approch o Tic-Tc-Toe Mke ble wih one enry per se Se o o o o o o o o o o 0.5 0.5 V(s) esimed probbiliy of winning win 0 loss 0.5 drw Now ply los of gmes. To pick our moves, look hed one sep Curren se Possible ne ses * Pick he ne se wih he highes esimed prob. of winning he lrges V(s) greedy move; Occsionlly pick move rndom n eplorory move.

RL Lerning Rule for Tic-Tc-Toe opponen's move { our move { opponen's move { our move { opponen's move { our move { sring posiion e * b c c * d e f s s he se before our greedy move he se fer our greedy move Eplorory move We incremen ech V(s) owrd V( s ) bckup : V(s) V (s) + α[ V( s ) V (s)] g g*. Why is Tic-Tc-Toe Too Esy? Number of ses is smll nd finie One-sep look-hed is lwys possible Se compleely observble Some Noble RL Applicions TD-Gmmon world s bes bckgmmon progrm (Tesuro) Elevor Conrol Cries & Bro Invenory Mngemen 0 5% improvemen over indusry sndrd mehods Vn Roy, Berseks, Lee nd Tsisiklis Dynmic Chnnel Assignmen -- high performnce ssignmen of rdio chnnels o mobile elephone clls Singh nd Berseks

The n-armed Bndi Problem Choose repeedly from one of n cions; ech choice is clled ply Afer ech ply, you ge rewrd r, where Er = Q * ( ) Disribuion of r depends only on Objecive is o mimize he rewrd in he long erm, e.g., over 000 plys The Eplorion Eploiion Dilemm Suppose you form Q ( Q * ( cion vlue esimes * = rgmq ( The greedy cion is = * eploiion * eplorion You cn eploi ll he ime; you cn eplore ll he ime You cn never sop eploring; bu you could reduce eploring Acion-Vlue Mehods Seless Adp cion-vlue esimes nd nohing else. Suppose by he -h ply, cion hd been chosen imes, producing rewrds r, r, K, r k, hen k Q ( = r + r + Lr k k lim Q ( = k Q* ( 3

Greedy ε-greedy Acion Selecion = * = rg mq ( ε-greedy Bolzmnn * wih probbiliy ε = rndom cion wih probbiliy ε Pr(choosing cion ime ) = whereτ is compuionl emperure e n b = Q ( τ e Q ( b) τ Incremenl Implemenion Recll he smple verge esimion mehod The verge of he firs k rewrds is Q k = r + r +Lr k k Incremenl upde rule does no require soring ps rewrds Q k + = Q k + [ k + r Q k + k] Trcking Nonsionry Environmen Choosing Q k o be smple verge is pproprie in Sionry environmen in which he dependence of Rewrds on cions is ime invrin when none of he Q * ( chnge over ime, In nonsionry environmen, i is beer o use eponenil, recency-weighed verge Q k + = Q k +α[ r k + Q k ] for consn α, 0< α = ( α) k Q 0 + α( α) k i r i k i = 4

Reinforcemen lerning when he gen cn sense nd respond o environmenl ses Agen se s rewrd r cion r + s + Environmen Agen nd environmen inerc discree ime seps: = 0,,, K Agen observes se sep : s S produces cion sep : A(s ) ges resuling rewrd: r + R nd resuling ne se: s +... r + r s s + + s r +3 + s + +3... + +3 The Agen Lerns Policy Policy sep, π : π ( s, = probbiliy h = when s = s mpping from ses o cion probbiliies Reinforcemen lerning mehods specify how he gen chnges is policy s resul of eperience. Roughly, he gen s gol is o ge s much rewrd s i cn over he long run. Agen-Environmen Inerfce -- Gols nd Rewrds Is sclr rewrd signl n deque noion of gol? mybe no, bu i is surprisingly fleible. A gol should specify wh we wn o chieve, no how we wn o chieve i. A gol is ypiclly ouside he gen s direc conrol The gen mus be ble o mesure success: eplicily frequenly during is lifespn 5

Rewrds Suppose he sequence of rewrds fer sep is : r, r, r + + + 3, K Wh do we wn o mimize? In generl, we wn o mimize he epeced for ech sep. reurn, E Episodic sks inercion breks nurlly ino episodes, e.g., plys of gme, rips hrough mze. R r + r + + r = + + L T, { R }, where T is finl ime sep which erminl se is reched, ending n episode. Rewrds for Coninuing Tsks Coninuing sks: inercion does no hve nurl episodes. Discouned reurn: = + + 3 + = k R r + γ r + γ r + L γ r k = 0 where γ,0 γ, is he discoun re., + k + shorsighed 0 γ frsighed Emple Pole Blncing Tsk Avoid filure: he pole flling beyond criicl ngle or he cr hiing end of rck. As n episodic sk where episode ends upon filure: rewrd =+ for ech sep before filure reurn = number of seps before filure As coninuing sk wih discouned reurn: rewrd = upon filure; 0 oherwise reurn = γ k, for k seps before filure In eiher cse, reurn is mimized by voiding filure for s long s possible. 6

Emple -- Driving sk Ge o he op of he hill s quickly s possible. rewrd = for ech sep when no op of hill reurn = number of seps before reching op of hill Reurn is mimized by minimizing he number of seps ken o rech he op of he hill. The Mrkov Propery Pr By he se sep, we men whever informion is vilble o he gen sep bou is environmen. The se cn include immedie sensions, highly processed sensions, nd srucures buil up over ime from sequences of sensions. Idelly, se should summrize ps sensions so s o rein ll essenil informion i should hve he Mrkov Propery: { s = s, r = r s,, r, s,, K, r, s, } = Pr{ s = s, r = r s, } + + s, r,nd hisories s,, r, s,, K, r, s,. 0 0 0 0 + + Mrkov Decision Processes If reinforcemen lerning sk hs he Mrkov Propery, i is clled Mrkov Decision Process (MDP). If se nd cion ses re finie, i is finie MDP. To define finie MDP, you need o specify: se nd cion ses one-sep dynmics defined by rnsiion probbiliies: P = Pr + ss rewrd probbiliies: R { s = s s = s, = } s, s S, A( s). { r s = s, =, s = s } s, s S, A( s). = E + + ss 7

Recycling Robo Finie MDP Emple A ech sep, robo hs o decide wheher i should cively serch for cn, b) wi for someone o bring i cn, or c) go o home bse nd rechrge. Serching is beer bu runs down he bery; if runs ou of power while serching, hs o be rescued (which is bd). Decisions mde on bsis of curren energy level: high, low. Rewrd = number of cns colleced Vlue Funcions The vlue of se is he epeced reurn sring from h se; depends on he gen s policy: Se - vlue funcion for policy π : π = { = } = k V ( s) Eπ R s s Eπ γ r + k k = 0 + s = s The vlue of king n cion in se under policy π is he epeced reurn sring from h se, king h cion, nd herefer following π : Acion - vlue funcion for policy π : { R s = s, = } π = = k Q ( s, Eπ Eπ γ r + k + s = s, = k = 0 Bellmn Equion for Policy π The bsic ide: R = r = r = r + + + + γr + γ + + γr ( r + γr + γ r L) + + + γ r So: Vπ (s) = E π R s = s + 3 + 3 + γ r 3 + 4 + 4 L { } { ( )s = s} = E π r + + γ Vs + Or, wihou he epecion operor: π [ R + γv ( s ] V ) π ( s) = π( s, P ss ss s 8

Opiml Vlue Funcions For finie MDPs, policies cn be prilly ordered: π π if nd only if V π (s) V π (s) for ll s S There is lwys les one (nd possibly mny) policies h is beer hn or equl o ll he ohers. This is n opiml policy. We denoe hem ll π *. Opiml policies shre he sme opiml se-vlue funcion: V (s) = mv π (s) for ll s S π Opiml policies lso shre he sme opiml cion-vlue funcion: Q (s, = mq π (s, for ll s S nd A(s) π This is he epeced reurn for king cion in se s nd herefer following n opiml policy. V Bellmn Opimliy Equion for V* V (s) = m A(s) Qπ (s, = m Er + + γ V (s + ) s = s, = A(s) { } P s s [ s + γ V ( s )] = m R s A(s) s The vlue of se under n opiml policy mus equl he epeced reurn for he bes cion from h se: The relevn bckup digrm: is he unique soluion of his sysem of nonliner equions. ( m s r s' Bellmn Opimliy Equion for Q* { } s [ R s s +γ mq ( s, ) ] Q (s, = Er + + γ mq (s +, ) s = s, = = P s s The relevn bckup digrm: (b) s, r s' m ' Q * is he unique soluion of his sysem of nonliner equions. 9

Why Opiml Se-Vlue Funcions re Useful Any policy h is greedy wih respec o V is n opiml policy. V Therefore, given, one-sep-hed serch produces he long-erm opiml cions. Wh Abou Opiml Acion-Vlue Funcions? Q * Given, he gen does no even hve o do one-sep-hed serch: π (s) = rg m A(s) Q (s, Solving he Bellmn Opimliy Equion Finding n opiml policy by solving he Bellmn Opimliy Equion requires: ccure knowledge of environmen dynmics; enough spce n ime o do he compuion; he Mrkov Propery. How much spce nd ime do we need? polynomil in number of ses (vi dynmic progrmming mehods), BUT, number of ses is ofen huge We usully hve o sele for pproimions. Mny RL mehods cn be undersood s pproimely solving he Bellmn Opimliy Equion. 0

Efficiency of DP To find n opiml policy is polynomil in he number of ses BUT, he number of ses ofen grows eponenilly wih he number of se vribles In prcice, clssicl DP cn be pplied o problems wih few millions of ses. Asynchronous DP cn be pplied o lrger problems, nd pproprie for prllel compuion. I is surprisingly esy o come up wih MDPs for which DP mehods re no prcicl. Reinforcemen lerning Environmen se rewrd cion Agen Mrkov Decision Processes Assume finie se of ses S se of cions A ech discree ime gen observes se s S nd chooses cion A hen receives immedie rewrd r nd se chnges o s + Mrkov ssumpion: s + = δ(s, ) nd r = r(s, ) i.e., r nd s + depend only on curren se nd cion funcions δ nd r my be nondeerminisic funcions δ nd r no necessrily known o gen

Agen s lerning sk Eecue cions in environmen, observe resuls, nd lern cion policy π : S A h mimizes E [r + γr + + γ r + + ] from ny sring se in S here 0 γ< is he discoun fcor for fuure rewrds Noe somehing new: Trge funcion is π : S A bu we hve no rining emples of form s, rining emples re of form s,, r Reinforcemen lerning problem Gol: lern o choose cions h mimize r 0 + γr + γ r +, where 0 γ< Lerning An Acion-Vlue Funcion Esime Q π for he curren behvior policy π. r r s + s + + s s, s +, + + s +, + Afer every rnsiion from nonerminl (, ) Q( s, ) + α[ r + γq( s, ) Q( s, )] Q s If s + + is erminl, hen Q( s +, + + ) = 0. + se s, do :

Vlue funcion To begin, consider deerminisic worlds... For ech possible policy π he gen migh dop, we cn define n evluion funcion over ses π V ( s) r + γr + γ r +... i= 0 + γ r i + i + where r, r +,... re genered by following policy π sring se s Resed, he sk is o lern he opiml policy π* π π * rg mv ( s),( s) π Wh o lern We migh ry o hve gen lern he evluion funcion V π* (which we wrie s V*) I could hen do look-hed serch o choose bes cion from ny se s becuse π *( s) rg m[ r( s, + γv *( δ ( s, )] A problem: This works well if gen knows δ : S A S, nd r : S A R Bu when i doesn', i cn' choose cions his wy 3

Acion-Vlue funcion Q funcion Define new funcion very similr o V* Q( s, r( s, + γv *( δ ( s, ) If gen lerns Q, i cn choose opiml cion even wihou knowing δ! π *( s) rg m[ r( s, + γv *( δ ( s, )] π π *( s) rg m Q( s, Q is he evluion funcion he gen will lern π Trining rule o lern Q Noe Q nd V* re closely reled: V *( s) = mq( s, ')) ' Which llows us o wrie Q recursively s Q( s, ) = r( s, ) + γv *( δ ( s, )) = r( s, ) + γ mq( δ ( s ' +, ')) Le Qˆ denoe lerner s curren pproimion o Q. Consider rining rule Qˆ( s, r + γm Qˆ( s', ' ) where s is he se resuling from pplying cion in se s. ' Q-Lerning [ + ] ( s, ) Q( s, ) + α r + γmq( s, Q( s ) Q, + 4

Q Lerning for Deerminisic Worlds For ech s, iniilize ble enry Observe curren se s Qˆ ( s, 0 Do forever: Selec n cion nd eecue i Receive immedie rewrd r Observe he new se s Upde he ble enry for Qˆ ( s, s follows: Qˆ( s) r + γ m Qˆ( s', ') s s. ' Upding Q Qˆ( s, righ ) r + γm Qˆ( s ', ' ) ' 0 + 0. 9m{ 63800,, } 90 Noice if rewrds non-negive, hen ( s,, n) Qˆ (, ) ˆ n+ s Qn ( s, nd ( s,, n) 0 Qˆ n( s, Q( s, Convergence heorem Theorem Qˆ converges o Q. Consider cse of deerminisic world, wih bounded immedie rewrds, where ech s, visied infiniely ofen. Proof: Define full inervl o be n inervl during which ech s, is visied. During ech full inervl he lrges error in Qˆ ble is reduced by fcor of γ. Le Qˆn be ble fer n updes, nd Δ n be he mimum error in Qˆn : h is Δ = m ˆ ( s, Q( s, n s, Q n 5

Convergence heorem For ny ble enry Qˆ n ( s, upded on ierion n +, he error in he revised esime ˆ ( s, ) is Q n+ Qˆ n+ ( s, Q( s, = ( r + γ m Qˆ ( s', ')) ( r + γ m Q( s', ')) = γ m Qˆ ( s', ') m Q( s', ') ' ' n n ' ' Q Lerning Recipe Qˆ Qˆ n+ n+ ( s, Q( s, = ( r + γ mqˆ ( s', ')) ( r + γ mq( s', ')) ( s, Q( s, ' s'', ' n ' ' n n n n ' γ m Qˆ ( s', ') Q( s', ') γ m Qˆ ( s'', ') Q( s'', ') ' = γ mqˆ ( s', ') mq( s', ') = γδ Noe we used generl fc h: m f( m f( m f( f ( Non-deerminisic cse Wh if rewrd nd ne se re non-deerminisic? We redefine V nd Q by king epeced vlues. π V ( s) E[ r + γr + + γ r + +...] i E γ r + i i= 0 Q( s, E[ r( s, + γv *( δ ( s, )] 6

where Nondeerminisic cse Q lerning generlizes o nondeerminisic worlds Aler rining rule o Qˆ ( s, ( α ) Qˆ ( s, [ r m Qˆ n n + α n + n n ' αn = + visis n ( s, ( s', ' )] Convergence of Qˆ o Q cn be proved [Wkins nd Dyn, 99] Temporl Difference Lerning Temporl Difference (TD) lerning mehods Cn be used when ccure models of he environmen re unvilble neiher se rnsiion funcion nor rewrd funcion re known Cn be eended o work wih implici represenions of cion-vlue funcions Are mong he mos useful reinforcemen lerning mehods Emple TD-Gmmon Lern o ply Bckgmmon (Tesuro, 995) Immedie rewrd: +00 if win -00 if lose 0 for ll oher ses Trined by plying.5 million gmes gins iself. Now comprble o he bes humn plyer. 7

λ Q ( s, ) ( λ)[ Q Temporl difference lerning Q s, ) r + γ mqˆ( s () ( + () ( s, ) + λq Q lerning: reduce discrepncy beween successive Q esimes One sep ime difference: () ( s, ) + λ Q, Why no wo seps? () Q ( s, ) r + γr + + γ mqˆ( s+, Or n? Q ( n) ( n ) ( s, ) r + γr + L+ γ r + n Blend ll of hese: + + n γ mqˆ( s + n (3), ( s, ) λ Q ( s, ) ( λ)[ Q Temporl difference lerning () Equivlen epression: ( s, ) + λq ( s, ) + λ Q λ λ Q ( s, ) = r + γ [( λ) mqˆ( s, ) + λq ( s+, + )] ( s, ) TD(λ) lgorihm uses bove rining rule Someimes converges fser hn Q lerning converges for lerning V * for ny 0 λ (Dyn, 99) Tesuro's TD-Gmmon uses his lgorihm () (3) Hndling Lrge Se Spces Replce Qˆ ble wih neurl ne or oher funcion pproimor Virully ny funcion pproimor would work provided i cn be upded in n online fshion 8

Lerning se-cion vlues Trining emples of he form: { descripion of ( s, ), v } The generl grdien-descen rule: r θ + = r θ +α[ v Q (s, )] r Q(s, ) θ Liner Grdien Descen Wkins Q(λ) 9