Gradient Descent for General Reinforcement Learning

Similar documents
Consider a system of 2 simultaneous first order linear equations

The Variance-Covariance Matrix

Summary: Solving a Homogeneous System of Two Linear First Order Equations in Two Unknowns

9. Simple Rules for Monetary Policy

Boosting and Ensemble Methods

Implementation of the Extended Conjugate Gradient Method for the Two- Dimensional Energized Wave Equation

Advanced Queueing Theory. M/G/1 Queueing Systems

innovations shocks white noise

Supplementary Figure 1. Experiment and simulation with finite qudit. anharmonicity. (a), Experimental data taken after a 60 ns three-tone pulse.

(heat loss divided by total enthalpy flux) is of the order of 8-16 times

t=0 t>0: + vr - i dvc Continuation

OUTLINE FOR Chapter 2-2. Basic Laws

State Observer Design

Homework: Introduction to Motion

Lecture 4 : Backpropagation Algorithm. Prof. Seul Jung ( Intelligent Systems and Emotional Engineering Laboratory) Chungnam National University

Wave Superposition Principle

Mathematical Statistics. Chapter VIII Sampling Distributions and the Central Limit Theorem

Theoretical Seismology

Vertical Sound Waves

ANALYTICITY THEOREM FOR FRACTIONAL LAPLACE TRANSFORM

Microscopic Flow Characteristics Time Headway - Distribution

Lecture 1: Numerical Integration The Trapezoidal and Simpson s Rule

FAULT TOLERANT SYSTEMS

10/7/14. Mixture Models. Comp 135 Introduction to Machine Learning and Data Mining. Maximum likelihood estimation. Mixture of Normals in 1D

SIMEON BALL AND AART BLOKHUIS

FUZZY NEURAL NETWORK CONTROL FOR GRAVURE PRINTING

"Science Stays True Here" Journal of Mathematics and Statistical Science, Volume 2016, Science Signpost Publishing

ELEN E4830 Digital Image Processing

Frequency Response. Response of an LTI System to Eigenfunction

4.1 The Uniform Distribution Def n: A c.r.v. X has a continuous uniform distribution on [a, b] when its pdf is = 1 a x b

Lucas Test is based on Euler s theorem which states that if n is any integer and a is coprime to n, then a φ(n) 1modn.

Gauge Theories. Elementary Particle Physics Strong Interaction Fenomenology. Diego Bettoni Academic year

A Note on Estimability in Linear Models

Neutron electric dipole moment on the lattice

AR(1) Process. The first-order autoregressive process, AR(1) is. where e t is WN(0, σ 2 )

The Penalty Cost Functional for the Two-Dimensional Energized Wave Equation

Soft k-means Clustering. Comp 135 Machine Learning Computer Science Tufts University. Mixture Models. Mixture of Normals in 1D

UNIT #5 EXPONENTIAL AND LOGARITHMIC FUNCTIONS

Safety and Reliability of Embedded Systems. (Sicherheit und Zuverlässigkeit eingebetteter Systeme) Stochastic Reliability Analysis

Safety and Reliability of Embedded Systems. (Sicherheit und Zuverlässigkeit eingebetteter Systeme) Stochastic Reliability Analysis

CONTINUOUS TIME DYNAMIC PROGRAMMING

, t 1. Transitions - this one was easy, but in general the hardest part is choosing the which variables are state and control variables

The Fourier Transform

Midterm exam 2, April 7, 2009 (solutions)

On the Derivatives of Bessel and Modified Bessel Functions with Respect to the Order and the Argument

by Lauren DeDieu Advisor: George Chen

EE243 Advanced Electromagnetic Theory Lec # 10: Poynting s Theorem, Time- Harmonic EM Fields

Chapter 9 Transient Response

10.5 Linear Viscoelasticity and the Laplace Transform

Economics 600: August, 2007 Dynamic Part: Problem Set 5. Problems on Differential Equations and Continuous Time Optimization

Grand Canonical Ensemble

Two-Dimensional Quantum Harmonic Oscillator

Dynamic Power Allocation in MIMO Fading Systems Without Channel Distribution Information

Spring 2006 Process Dynamics, Operations, and Control Lesson 2: Mathematics Review

CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 259] B-Trees

Outlier-tolerant parameter estimation

September 27, Introduction to Ordinary Differential Equations. ME 501A Seminar in Engineering Analysis Page 1. Outline

a 1and x is any real number.

Problem 1: Consider the following stationary data generation process for a random variable y t. e t ~ N(0,1) i.i.d.

Black-Scholes Partial Differential Equation In The Mellin Transform Domain

Applying Software Reliability Techniques to Low Retail Demand Estimation

Oscillations of Hyperbolic Systems with Functional Arguments *

Guaranteed Cost Control for a Class of Uncertain Delay Systems with Actuator Failures Based on Switching Method

Chapter 8 Theories of Systems

Engineering Circuit Analysis 8th Edition Chapter Nine Exercise Solutions

On the Speed of Heat Wave. Mihály Makai

CHAPTER: 3 INVERSE EXPONENTIAL DISTRIBUTION: DIFFERENT METHOD OF ESTIMATIONS

10. If p and q are the lengths of the perpendiculars from the origin on the tangent and the normal to the curve

Chapter 7 Stead St y- ate Errors

1. Inverse Matrix 4[(3 7) (02)] 1[(0 7) (3 2)] Recall that the inverse of A is equal to:

COMPLEX NUMBER PAIRWISE COMPARISON AND COMPLEX NUMBER AHP

Chapter 13 Laplace Transform Analysis

CONSISTENT EARTHQUAKE ACCELERATION AND DISPLACEMENT RECORDS

Lecture 37 (Schrödinger Equation) Physics Spring 2018 Douglas Fields

Final Exam : Solutions

2.1. Differential Equations and Solutions #3, 4, 17, 20, 24, 35

Journal of Theoretical and Applied Information Technology 10 th January Vol. 47 No JATIT & LLS. All rights reserved.

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING SIGNALS AND SYSTEMS. Assoc. Prof. Dr. Burak Kelleci. Spring 2018

Institute of Actuaries of India

Chap 2: Reliability and Availability Models

Optimal Ordering Policy in a Two-Level Supply Chain with Budget Constraint

A CONVERGENCE MODEL OF THE TERM STRUCTURE OF INTEREST RATES

Charging of capacitor through inductor and resistor

Valuing Energy Options in a One Factor Model Fitted to Forward Prices

3.4 Properties of the Stress Tensor

NAME: ANSWER KEY DATE: PERIOD. DIRECTIONS: MULTIPLE CHOICE. Choose the letter of the correct answer.

Testing EBUASI Class of Life Distribution Based on Goodness of Fit Approach

Dynamic Controllability with Overlapping Targets: Or Why Target Independence May Not be Good for You

First looking at the scalar potential term, suppose that the displacement is given by u = φ. If one can find a scalar φ such that u = φ. u x.

CIVL 8/ D Boundary Value Problems - Triangular Elements (T6) 1/8

4. Which of the following organs develops first?

Copyright 2000, Kevin Wayne 1

The Hyperelastic material is examined in this section.

Lecture 08 Multiple View Geometry 2. Prof. Dr. Davide Scaramuzza

One dimensional steady state heat transfer of composite slabs

Math 656 March 10, 2011 Midterm Examination Solutions

Continous system: differential equations

Boyce/DiPrima 9 th ed, Ch 2.1: Linear Equations; Method of Integrating Factors

Gaussian Random Process and Its Application for Detecting the Ionospheric Disturbances Using GPS

Fourier Transform: Overview. The Fourier Transform. Why Fourier Transform? What is FT? FT of a pulse function. FT maps a function to its frequencies

Transcription:

To appar n M. S. Karns, S. A. Solla, and D. A. Cohn, dors, Advancs n Nral Informaon Procssng Sysms, MIT Prss, Cambrdg, MA, 999. Gradn Dscn for Gnral Rnforcmn Larnng Lmon Bard Andrw Moor lmon@cs.cm.d awm@cs.cm.d www.cs.cm.d/~lmon www.cs.cm.d/~awm Compr Scnc Dparmn Compr Scnc Dparmn 5000 Forbs Avn 5000 Forbs Avn Carng Mllon Unvrsy Carng Mllon Unvrsy Psbrgh, PA 53-389 Psbrgh, PA 53-389 Absrac A smpl larnng rl s drvd, h VAPS algorhm, whch can b nsanad o gnra a wd rang of nw rnforcmnlarnng algorhms. Ths algorhms solv a nmbr of opn problms, dfn svral nw approachs o rnforcmn larnng, and nfy dffrn approachs o rnforcmn larnng ndr a sngl hory. Ths algorhms all hav garand convrgnc, and ncld modfcaons of svral xsng algorhms ha wr known o fal o convrg on smpl MDPs. Ths ncld Q- larnng, SARSA, and advanag larnng. In addon o hs val-basd algorhms also gnras pr polcy-sarch rnforcmn-larnng algorhms, whch larn opmal polcs who larnng a val fncon. In addon, allows polcysarch and val-basd algorhms o b combnd, hs nfyng wo vry dffrn approachs o rnforcmn larnng no a sngl Val and Polcy Sarch (VAPS algorhm. And hs algorhms convrg for POMDPs who rqrng a propr blf sa. Smlaons rsls ar gvn, and svral aras for fr rsarch ar dscssd. CONVERGENCE OF GREEDY EXPLORATION Many rnforcmn-larnng algorhms ar known ha s a paramrzd fncon approxmaor o rprsn a val fncon, and adjs h wghs ncrmnally drng larnng. Exampls ncld Q-larnng, SARSA, and advanag larnng. Thr ar smpl MDPs whr h orgnal form of hs algorhms fals o convrg, as smmarzd n Tabl. For h cass wh, h algorhms ar garand o convrg ndr rasonabl assmpons sch as

Tabl. Crrn convrgnc rsls for ncrmnal, val-basd RL algorhms. Rsdal algorhms changd vry X n h frs wo colmns o. Th nw algorhms proposd n hs papr chang vry X o a. Fxd dsrbon (on-polcy Fxd dsrbon Usallygrdy dsrbon Lookp abl Markov Avragr chan Lnar X Nonlnar X X Lookp abl MDP Avragr X Lnar X X X Nonlnar X X X Lookp abl X POMDP Avragr X Lnar X X X Nonlnar X X X convrgnc garand Xconrxampl s known ha hr dvrgs or oscllas bwn h bs and wors possbl polcs. dcayng larnng ras. For h cass wh X, hr ar known conrxampls whr wll hr dvrg or osclla bwn h bs and wors possbl polcs, whch hav vry-dffrn vals. Ths can happn vn wh nfn ranng m and slowly-dcrasng larnng ras (Bard, 95, Gordon, 96. Each X n h frs wo colmns can b changd o a and mad o convrg by sng a modfd form of h algorhm, h rsdal form (Bard 95. B hs s only possbl whn larnng wh a fxd ranng dsrbon, and ha s rarly praccal. For mos larg problms, s sfl o xplor wh a polcy ha s sally-grdy wh rspc o h crrn val fncon, and ha changs as h val fncon changs. In ha cas (h rghmos colmn of h char, h crrn convrgnc garans ar no vry good. On way o garan convrgnc n all hr colmns s o modfy h algorhm so ha s prformng sochasc gradn dscn on som avrag rror fncon, whr h avrag s wghd by sa-vsaon frqncs for h crrn sally-grdy polcy. Thn h wghng changs as h polcy changs. I mgh appar ha hs gradn s dffcl o comp. Consdr Q- larnng xplorng wh a Bolzman dsrbon ha s sally grdy wh rspc o h larnd Q fncon. I sms dffcl o calcla gradns, snc changng a sngl wgh wll chang many Q vals, changng a sngl Q val wll chang many acon-choc probabls n ha sa, and changng a sngl acon-choc probably may affc h frqncy wh whch vry sa n h MDP s vsd. Alhogh hs mgh sm dffcl, s no. Srprsngly, nbasd smas of h gradns of vsaon dsrbons wh rspc o h wghs can b calclad qckly, and h rslng algorhms can p a n vry cas n Tabl. DERIVATION OF THE VAPS EQUATION Consdr a sqnc of ransons obsrvd whl followng a parclar sochasc polcy on an MDP. L s {x 0, 0,R 0, x,,r, x -, -,R -, x,,r } b h sqnc of sas, acons, and rnforcmns p o m, whr prformng acon n sa x ylds rnforcmn R and a ranson o sa x +. Th

sochasc polcy may b a fncon of a vcor of wghs w. Assm h MDP has a sngl sar sa namd x 0. If h MDP has rmnal sas, and x s a rmnal sa, hn x + x 0. L S b h s of all possbl sqncs from m 0 o. L (s b a gvn rror fncon ha calclas an rror on ach m sp, sch as h sqard Bllman rsdal a m, or som ohr rror occrrng a m. If s a fncon of h wghs, hn ms b a smooh fncon of h wghs. Consdr a prod of m sarng a m 0 and ndng wh probably P(nd s afr h sqnc s occrs. Th probabls ms b sch ha h xpcd sqard prod lngh s fn. L B b h xpcd oal rror drng ha prod, whr h xpcaon s wghd accordng o h sa-vsaon frqncs gnrad by h gvn polcy: B P(prod nds a mt afr rajcory s T T 0 st ST 0 0 s S ( s P( s T ( s ( ( whr: P( s P( s P( R s 0 P( s P( R s P( s s [ P( nd s ] + No ha on h frs ln, for a parclar s, h rror (s wll b addd n o B onc for vry sqnc ha sars wh s. Each of hs rms wll b wghd by h probably of a compl rajcory ha sars wh s. Th sm of h probabls of all rajcors ha sar wh s s smply h probably of s bng obsrvd, snc h prod s assmd o nd vnally wh probably on. So h scond ln qals h frs. Th hrd ln s h probably of h sqnc, of whch only h P( x facor mgh b a fncon of w. If so, hs probably ms b a smooh fncon of h wghs and nonzro vrywhr. Th paral drvav of B wh rspc o w, a parclar lmn of h wgh vcor w, s: B w [ P( s ] w j j + ( s P( s ( s P( s 0 s j P( j s j S w P( s ( s + ( s w w 0 s S j ln ( P( s j j Spac hr s lmd, and may no b clar from h shor skch of hs drvaon, b smmng (5 ovr an nr prod dos gv an nbasd sma of B, h xpcd oal rror drng a prod. An ncrmnal algorhm o prform sochasc gradn dscn on B s h wgh pda gvn on h lf sd of Tabl, whr h smmaon ovr prvos m sps s rplacd wh a rac T for ach wgh. Ths algorhm s mor gnral han prvosly-pblshd algorhms of hs form, n ha can b a fncon of all prvos sas, acons, and rnforcmns, rahr han js h crrn rnforcmn. Ths s wha allows VAPS o do boh val and polcy sarch. Evry algorhm proposd n hs papr s a spcal cas of h VAPS qaon on h lf sd of Tabl. No ha no modl s ndd for hs algorhm. Th only probably ndd n h algorhm s h polcy, no h ranson probably from h MDP. Ths s sochasc gradn dscn on B, and h pda rl s only corrc f h obsrvd ransons ar sampld from rajcors fond by followng (3 (4 (5

Tabl. Th gnral VAPS algorhm (lf, and svral nsanaons of (rgh. Ths sngl algorhm nclds boh val-basd and polcy-sarch approachs and hr combnaon, and gvs garand convrgnc n vry cas. [ ( s + ( s T ] w α w T ln( P( s w [ R + γq( x, Q( x ], SARSA ( s E s E [ R + γ max Q( x, Q( x ], Q larnng ( advanag ( s R E + γ max A( x, + K A( x, ( max A( x, K ( max [ ( ] ( val raon s E R + γv x V x SARSA polcy ( s ( β SARSA ( s + β ( b γ R h crrn, sochasc polcy. Boh and P shold b smooh fncons of w, and for any gvn w vcor, shold b bondd. Th algorhm s smpl, b acally gnras a larg class of dffrn algorhms dpndng on h choc of and whn h rac s rs o zro. For a sngl sqnc, sampld by followng h crrn polcy, h sm of w along h sqnc wll gv an nbasd sma of h r gradn, wh fn varanc. Thrfor, drng larnng, f wgh pdas ar mad a h nd of ach ral, and f h wghs say whn a bondd rgon, and h larnng ra approachs zro, hn B wll convrg wh probably on. Addng a wgh-dcay rm (a consan ms h -norm of h wgh vcor ono B wll prvn wgh dvrgnc for small nal larnng ras. Thr s no garan ha a global mnmm wll b fond whn sng gnral fncon approxmaors, b a las wll convrg. Ths s r for backprop as wll. 3 INSTANTIATING THE VAPS ALGORITHM Many rnforcmn-larnng algorhms ar val-basd; hy ry o larn a val fncon ha sasfs h Bllman qaon. Exampls ar Q-larnng, whch larns a val fncon, acor-crc algorhms, whch larn a val fncon and h polcy whch s grdy wh rspc o, and TD(, whch larns a val fncon basd on fr rwards. Ohr algorhms ar pr polcy-sarch algorhms; hy drcly larn a polcy ha rrns hgh rwards. Ths ncld REINFORCE (Wllams, 988, backprop hrogh m, larnng aomaa, and gnc algorhms. Th algorhms proposd hr combn h wo approachs: hy prform Val And Polcy Sarch (VAPS. Th gnral VAPS qaon s nsanad by choosng an xprsson for. Ths can b a Bllman rsdal (yldng val-basd, h rnforcmn (yldng polcy-sarch, or a lnar combnaon of h wo (yldng Val And Polcy Sarch. Th sngl VAPS pda rl on h lf sd of Tabl gnras a vary of dffrn yps of algorhms, som of whch ar dscrbd n h followng scons. 3. REDUCING MEAN SQUARED RESIDUAL PER TRIAL If h MDP has rmnal sas, and a ral s h m from h sar nl a rmnal sa s rachd, hn s possbl o mnmz h xpcd oal rror pr ral by rsng h rac o zro a h sar of ach ral. Thn, a convrgn form of SARSA, Q-larnng, ncrmnal val raon, or advanag larnng can b gnrad by choosng o b h sqard Bllman rsdal, as shown on h rgh sd of Tabl. In ach cas, h xpcd val s akn ovr all possbl (x,,r

rpls, gvn s -. Th polcy ms b a smooh, nonzro fncon of h wghs. So cold no b an ε-grdy polcy ha chooss h grdy acon wh probably (-ε and chooss nformly ohrws. Tha wold cas a dsconny n h gradn whn wo Q vals n a sa wr qal. B h polcy cold b somhng ha approachs ε-grdy as a posv mprar c approachs zro: ε P( x + n ( ε + + ' Q( x, / c Q( x, ' / c ( whr n s h nmbr of possbl acons n ach sa. For ach nsanc n Tabl ohr han val raon, h gradn of can b smad sng wo, ndpndn, nbasd smas of h xpcd val. For xampl: w s SARSA ( s γφ Q( x', ' Q( x, (7 w w SARSA ( Whn φ, hs s an sma of h r gradn. Whn φ<, hs s a rsdal algorhm, as dscrbd n (Bard, 96, and rans garand convrgnc, b may larn mor qckly han pr gradn dscn for som vals of φ. No ha h gradn of Q(x, a m ss prmd varabls. Tha mans a nw sa and acon a m wr gnrad ndpndnly from h sa and acon a m -. Of cors, f h MDP s drmnsc, hn h prmd varabls ar h sam as h nprmd. If h MDP s nondrmnsc b h modl s known, hn h modl ms b valad on addonal m o g h ohr sa. If h modl s no known, hn hr ar hr chocs. Frs, a modl cold b larnd from pas daa, and hn valad o gv hs ndpndn sampl. Scond, h ss cold b gnord, smply rsng h nprmd varabls n plac of h prmd varabls. Ths may affc h qaly of h larnd fncon (dpndng on how random h MDP s, b dosn sop convrgnc, and b an accpabl approxmaon n pracc. Thrd, all pas ransons cold b rcordd, and h prmd varabls cold b fond by sarchng for all h ms (x -, - has bn sn bfor, and randomly choosng on of hos ransons and sng s sccssor sa and acon as h prmd varabls. Ths s qvaln o larnng h crany qvalnc modl, and samplng from, and so s a spcal cas of h frs choc. For xrmly larg sa-acon spacs wh many sarng sas, hs s lkly o gv h sam rsl n pracc as smply rsng h nprmd varabls as h prmd varabls. No, ha whn wghs do no ffc h polcy a all, hs algorhms rdc o sandard rsdal algorhms (Bard, 95. I s also possbl o rdc h man sqard rsdal pr sp, rahr han pr ral. Ths s don by makng prod lnghs ndpndn of h polcy, so mnmzng rror pr prod wll also mnmz h rror pr sp. For xampl, a prod mgh b dfnd o b h frs 00 sps, afr whch h racs ar rs, and h sa s rrnd o h sar sa. No ha f vry sa-acon par has a posv chanc of bng sn n h frs 00 sps, hn hs wll no js b solvng a fn-horzon problm. I wll b acally b solvng h dscond, nfn-horzon problm, by rdcng h Bllman rsdal n vry sa. B h wghng of h rsdals wll b drmnd only by wha happns drng h frs 00 sps. Many dffrn problms can b solvd by h VAPS algorhm by nsanang h dfnon of "prod" n dffrn ways. 3. POLICY-SEARCH AND VALUE-BASED LEARNING I s also possbl o add a rm ha rs o maxmz rnforcmn drcly. For xampl, cold b dfnd o b SARSA-polcy rahr han SARSA. from Tabl, and (6

A 0000 sar B Trals 000 nd 00 0 0. 0.4 0.6 0.8 Ba Fgr. A POMDP and h nmbr of rals ndd o larn vs. β. A combnaon of polcy-sarch and val-basd RL oprforms hr alon. h rac rs o zro afr ach rmnal sa s rachd. Th consan b dos no affc h xpcd gradn, b dos affc h nos dsrbon, as dscssd n (Wllams, 88. Whn β0, h algorhm wll ry o larn a Q fncon ha sasfs h Bllman qaon, js as bfor. Whn β, drcly larns a polcy ha wll mnmz h xpcd oal dscond rnforcmn. Th rslng Q fncon may no vn b clos o conanng r Q vals or o sasfyng h Bllman qaon, wll js gv a good polcy. Whn β s n bwn, hs algorhm rs o boh sasfy h Bllman qaon and gv good grdy polcs. A smlar modfcaon can b mad o any of h algorhms n Tabl. In h spcal cas whr β, hs algorhm rdcs o h REINFORCE algorhm (Wllams, 988. REINFORCE has bn rdrvd for h spcal cas of gassan acon dsrbons (Trsp & Hofman, 995, and xnsons of appar n (Marbach, 998. Ths cas of pr polcy sarch s parclarly nrsng, bcas for β, hr s no nd for any knd of modl or of gnrang wo ndpndn sccssors. Ohr algorhms hav bn proposd for fndng polcs drcly, sch as hos gvn n (Gllapall, 9 and h varos algorhms from larnng aomaa hory smmarzd n (Narndra & Thahachar, 89. Th VAPS algorhms proposd hr appars o b h frs on nfyng hs wo approachs o rnforcmn larnng, fndng a val fncon ha boh approxmas a Bllman-qaon solon and drcly opmzs h grdy polcy. Fgr shows smlaon rsls for h combnd algorhm. A rn s sad o hav larnd whn h grdy polcy s opmal for 000 conscv rals. Th graph shows h avrag plo of 00 rns, wh dffrn nal random wghs bwn ±0-6. Th larnng ra was opmzd sparaly for ach β val. R whn lavng sa A, R whn lavng sa B or nrng nd, and R0 ohrws. γ0.9. Th algorhm sd was h modfd Q-larnng from Tabl, wh xploraon as n qaon 6, and ϕc, b0, ε0.. Sas A and B shar h sam paramrs, so ordnary SARSA or grdy Q-larnng cold nvr convrg, as shown n (Gordon, 96. Whn β0 (pr val-basd, h nw algorhm convrgs, b of cors canno larn h opmal polcy n h sar sa, snc hos wo Q vals larn o b qal. Whn β (pr polcy-sarch, larnng convrgs o opmaly, b slowly, snc hr s no val fncon cachng h rsls n h long sqnc of sas nar h nd. By combnng h wo approachs, h nw algorhm larns mch mor qckly han hr alon. I s nrsng ha h VAPS algorhms dscrbd n h las hr scons can b appld drcly o a Parally Obsrvabl Markov Dcson Procss (POMDP, whr h r sa s hddn, and all ha s avalabl on ach m sp s an

ambgos obsrvaon, whch s a fncon of h r sa. Normally, an algorhm sch as SARSA only has garand convrgnc whn appld o an MDP. Th VAPS algorhms wll convrg n sch cass. 4 CONCLUSION A nw algorhm has bn prsnd. Spcal cass of gv nw algorhms smlar o Q-larnng, SARSA, and advanag larnng, b wh garand convrgnc for a wdr rang of problms han was prvosly possbl, ncldng POMDPs. For h frs m, hs can b garand o convrg, vn whn h xploraon polcy changs drng larnng. Ohr spcal cass allow nw approachs o rnforcmn larnng, whr hr s a radoff bwn sasfyng h Bllman qaon and mprovng h grdy polcy. For on MDP, smlaon showd ha hs combnd algorhm larnd mor qckly han hr approach alon. Ths nfd hory, nfyng for h frs m boh val-basd and polcysarch rnforcmn larnng, s of horcal nrs, and also was of praccal val for h smlaons prformd. Fr rsarch wh hs nfd framwork may b abl o mprcally or analycally addrss h old qson of whn s br o larn val fncons and whn s br o larn h polcy drcly. I may also shd lgh on h nw qson, of whn s bs o do boh a onc. Acknowldgmns Ths rsarch was sponsord n par by h U.S. Ar Forc. Rfrncs Bard, L. C. (995. Rsdal Algorhms: Rnforcmn Larnng wh Fncon Approxmaon. In Armand Prds & Sar Rssll, ds. Machn Larnng: Procdngs of h Twlfh Inrnaonal Confrnc, 9- Jly, Morgan Kafman Pblshrs, San Francsco, CA. Gordon, G. (996. Sabl fd rnforcmn larnng. In G. Tsaro, M. Mozr, and M. Hasslmo (ds., Advancs n Nral Informaon Procssng Sysms 8, pp. 05-058. MIT Prss, Cambrdg, MA. Gllapall, V. (99. Rnforcmn Larnng and Is Applcaon o Conrol. Dssraon and COINS Tchncal Rpor 9-0, Unvrsy of Massachss, Amhrs, MA. Kalblng, L. P., Lman, M. L. & Cassandra, A., Plannng and Acng n Parally Obsrvabl Sochasc Domans. Arfcal Inllgnc, o appar. Avalabl now a hp://www.cs.brown.d/popl/lpk. Marbach, P. (998. Smlaon-Basd Opmzaon of Markov Dcson Procsss. Thss LIDS-TH 49, Massachss Ins of Tchnology. McCallm (995, A. Rnforcmn larnng wh slcv prcpon and hddn sa. Dssraon, Dparmn of Compr Scnc, Unvrsy of Rochsr, Rochsr, NY. Narndra, K., & Thahachar, M.A.L. (989. Larnng aomaa: An nrodcon. Prnc Hall, Englwood Clffs, NJ. Trsp, V., & R. Hofman (995. "Mssng and nosy daa n nonlnar m-srs prdcon". In Procdngs of Nral Nworks for Sgnal Procssng 5, F. Gros, J. Makhol, E. Manolakos and E. Wlson, ds., IEEE Sgnal Procssng Socy, Nw York, Nw York, 995, pp. -0. Wllams, R. J. (988. Toward a hory of rnforcmn-larnng conncons sysms. Tchncal rpor NU-CCS-88-3, Norhasrn Unvrsy, Boson, MA.