Reinforcement Learning

Similar documents
Lecture 6: Learning for Control (Generalised Linear Regression)

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

Lecture VI Regression

Clustering with Gaussian Mixtures

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

Fall 2010 Graduate Course on Dynamic Learning

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Variants of Pegasos. December 11, 2009

( ) () we define the interaction representation by the unitary transformation () = ()

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Solution in semi infinite diffusion couples (error function analysis)

Department of Economics University of Toronto

FTCS Solution to the Heat Equation

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

Density Matrix Description of NMR BCMB/CHEM 8190

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

WiH Wei He

(,,, ) (,,, ). In addition, there are three other consumers, -2, -1, and 0. Consumer -2 has the utility function

CHAPTER 10: LINEAR DISCRIMINATION

Lecture 11 SVM cont

Let s treat the problem of the response of a system to an applied external force. Again,

Mechanics Physics 151

Appendix to Online Clustering with Experts

Mechanics Physics 151

Density Matrix Description of NMR BCMB/CHEM 8190

FI 3103 Quantum Physics

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

Notes on the stability of dynamic systems and the use of Eigen Values.

CS 268: Packet Scheduling

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

Robustness Experiments with Two Variance Components

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

1 Constant Real Rate C 1

Mechanics Physics 151

Lecture 2 M/G/1 queues. M/G/1-queue

Advanced time-series analysis (University of Lund, Economic History Department)

Hidden Markov Models

On One Analytic Method of. Constructing Program Controls

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Displacement, Velocity, and Acceleration. (WHERE and WHEN?)

Graduate Macroeconomics 2 Problem set 5. - Solutions

January Examinations 2012

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

Panel Data Regression Models

Performance Analysis for a Network having Standby Redundant Unit with Waiting in Repair

We are estimating the density of long distant migrant (LDM) birds in wetlands along Lake Michigan.

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Chapters 2 Kinematics. Position, Distance, Displacement

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Fall 2009 Social Sciences 7418 University of Wisconsin-Madison. Problem Set 2 Answers (4) (6) di = D (10)

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Online Appendix for. Strategic safety stocks in supply chains with evolving forecasts

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

10. A.C CIRCUITS. Theoretically current grows to maximum value after infinite time. But practically it grows to maximum after 5τ. Decay of current :

Linear Response Theory: The connection between QFT and experiments

Machine Learning Linear Regression

Scattering at an Interface: Oblique Incidence

RELATIONSHIP BETWEEN VOLATILITY AND TRADING VOLUME: THE CASE OF HSI STOCK RETURNS DATA

2. SPATIALLY LAGGED DEPENDENT VARIABLES

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Normal Random Variable and its discriminant functions

Computing Relevance, Similarity: The Vector Space Model

Cubic Bezier Homotopy Function for Solving Exponential Equations

2.1 Constitutive Theory

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Dual Approximate Dynamic Programming for Large Scale Hydro Valleys

Volatility Interpolation

Advanced Macroeconomics II: Exchange economy

Reinforcement learning

Epistemic Game Theory: Online Appendix

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

, t 1. Transitions - this one was easy, but in general the hardest part is choosing the which variables are state and control variables

Lecture 2 L n i e n a e r a M od o e d l e s

Time-interval analysis of β decay. V. Horvat and J. C. Hardy

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

CS286.2 Lecture 14: Quantum de Finetti Theorems II

F-Tests and Analysis of Variance (ANOVA) in the Simple Linear Regression Model. 1. Introduction

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

OMXS30 Balance 20% Index Rules

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Approximate Analytic Solution of (2+1) - Dimensional Zakharov-Kuznetsov(Zk) Equations Using Homotopy

A Cell Decomposition Approach to Online Evasive Path Planning and the Video Game Ms. Pac-Man

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

An introduction to Support Vector Machine

Lecture 9: Dynamic Properties

Clustering (Bishop ch 9)

Li An-Ping. Beijing , P.R.China

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Chapter Lagrangian Interpolation

Bandlimited channel. Intersymbol interference (ISI) This non-ideal communication channel is also called dispersive channel

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

Transcription:

Noe o oher eachers and users of hese sldes. Andrew would be delghed f you found hs source maeral useful n gvng your own lecures. Feel free o use hese sldes verbam, or o modfy hem o f your own needs. PowerPon orgnals are avalable. If you make use of a sgnfcan poron of hese sldes n your own lecure, please nclude hs message, or he followng lnk o he source reposory of Andrew s uorals: hp://www.cs.cmu.edu/~awm/uorals. Commens and correcons graefully receved. Renforcemen Learnng Andrew W. Moore Assocae Professor School of Compuer Scence Carnege Mellon Unversy www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyrgh 2002, Andrew W. Moore Aprl 23rd, 2002 Predcng Delayed Rewards IN A 0.4 DISCOUNTED MARKOV SYSTEM 0.6 0.5 0.2 R=-5 S R=0 S 4 0.5 R=0 0.5 S 5 0.8 0.5 R=3 R=0 S 0.2 S 3 2 0.4 0.4 R= S 6 Prob(nex sae = S 5 hs sae = S 4 ) = 0.8 ec Wha s expeced sum of fuure rewards (dscouned)? Ε = 0 γ ( [ ] ) S[ 0] = S R S us Solve I! We use sandard Markov Sysem Theory Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2

Learnng Delayed Rewards S? R=???? S 4? R=??? S 2? R=??? S 5? R=??? S 3? R=??? S 6? R=??? All you can see s a seres of saes and rewards: S (R=0) S 2 (R=0) S 3 (R=4) S 2 (R=0) S 4 (R=0) S 5 (R=0) Task: Based on hs sequence, mae *(S ),*(S 2 ) *(S 6 ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 Idea : Supervsed Learnng S (R=0) S 2 (R=0) Assume?=/2 S 3 (R=4) S 2 (R=0) S 4 (R=0) S 5 (R=0) A = we were n sae S and evenually go a long erm dscouned reward of 0+?0+? 2 4+? 3 0+? 4 0 = A =2 n sae S 2 ldr = 2 A =5 n sae S 4 ldr = 0 A =3 n sae S 3 ldr = 4 A =6 n sae S 5 ldr = 0 A =4 n sae S 2 ldr = 0 Sae Observaons of LTDR Mean LTDR S = (S ) S 2 2, 0 = (S 2 ) S 3 4 4 = (S 3 ) S 4 0 0 = (S 4 ) S 5 0 0 = (S 5 ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 2

Supervsed Learnng ALG Wach a raecory S[0] r[0] S[] r[] S[T]r[T] For =0,, T, compue [ ] Compue ( S ) = =0 mean value of = among n sae S Le MATCHES ( ) γ r[ + ] [ ] all ranso ns begnnng on he raecor y ( S ) = { S[ ] [ ] MATCHES S You re done! S = MATCHES( S ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 5 ( ) = S }, hen defne Supervsed Learnng ALG for he md If you have an anxous personaly you may be worred abou edge effecs for some of he fnal ransons. Wh large raecores hese are neglgble. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 6 3

Onlne Supervsed Learnng Inalze: Coun[S ] = 0 S Sum[S ] = 0 S Elgbly[S ] = 0 S Observe: When we experence S wh reward r do hs: Elg[S ]?Elg[S ] Elg[S ] Elg[S ] + Sum[S ] Sum[S ]+rxelg[s ] Coun[S ] Coun[S ] + Then a any me, (S )= Sum[S ]/Coun[S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 7 Onlne Supervsed Learnng Economcs Gven N saes S S N, OSL needs O(N) memory. Each updae needs O(N) work snce we mus updae all Elg[ ] array elemens Idea: Be sparse and only updae/process Elg[ ] elemens wh values >?for ny? There are only log such elemens ξ log γ Easy o prove: As T, ( S ) ( S) S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 8 4

Onlne Supervsed Learnng Le s grab OSL off he sree, bundle no a black van, ake o a bunker and nerrogae under 600 Wa lghs. S (r=0) S 2 (r=0) Sae Observaons of LTDR ^ (S ) S S 2 2, 0 S 3 4 4 S 4 0 0 S 5 0 0 COMPLAINT S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) There s somehng a lle suspcous abou hs (effcency-wse) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 9 Cerany-Equvalen (CE) Learnng S (r=0) S 2 (r=0) Idea: Use your daa o mae he underlyng Markov sysem, nsead of ryng o mae drecly. Esmaed Markov Sysem: S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) You draw n he ransons + probs S r = 0 S 2 r = 0 S 3 r = 4 S 4 r = 0 S 5 r = 0 Wha re he maed values? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 0 5

C.E. Mehod for Markov Sysems Inalze: Coun[S ] = 0 S #Tmes vsed S SumR[S ] = 0 Sum of rewards from S S Trans[S,S ] = 0 #Tmes ransoned from S S When we are n sae S, and we receve reward r, and we move o S Coun[S ] Coun[S ] + SumR[S ] SumR[S ] + r Trans[S,S ] Trans[S,S ] + Then a any me r (S ) = SumR[S ] / Coun[S ] P = Esmaed Prob(nex = S hs = S ) = Trans[S,S ] / Coun[S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde C.E. for Markov Sysems (connued) So a any me we have r (S ) and P (nex=s hs=s ) S S = P So a any me we can solve he se of lnear equaons ( S ) = r ( S ) + γ P ( S S ) ( S ) [In vecor noaon, = r +?P => = (I-?P ) - r where r are vecors of lengh N P s an NxN marx N = # saes ] S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2 6

C.E. Onlne Economcs Memory: O(N 2 ) Tme o updae couners: O() Tme o re-evaluae O(N 3 ) f use marx nverson O(N 2 k CRIT ) f use value eraon and we need k CRIT eraons o converge O(Nk CRIT ) f use value eraon, and k CRIT o converge, and M.S. s Sparse (.e. mean # successors s consan) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 Cerany Equvalen Learnng Memory use could be O(N 2 )! And me per updae could be O(Nk CRIT ) up o O(N 3 )! Too expensve for some people. COMPLAINT Prorzed sweepng wll help, (see laer), bu frs le s revew a very nexpensve approach Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 7

Why hs obsesson wh onlneness? I really care abou supplyng up-o-dae maes all he me. Can you guess why? If no, all wll be revealed n good me Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 5 Less Tme: More Daa Lmed Backups Do prevous C.E. algorhm. A each me mep we observe S (r) S and updae Coun[S ], SumR[S ], Trans[S,S ] And hus also updae maes r and P oucomes ( S ) Bu nsead of re-solvng for, do much less work. us do one backup of S [ ] [ S ] r + γ P [ S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 6 8

One Backup C.E. Economcs Space : O(N 2 ) NO IMPROVEMENT THERE! Tme o updae sascs : O() Tme o updae : O() Good News: Much cheaper per ranson Good News: Conracon Mappng proof (modfed) promses convergence o opmal Bad News: Wases daa Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 7 Prorzed Sweepng [Moore + Akeson, 93] Tres o be almos as daa-effcen as full CE bu no much more expensve han One Backup CE. On every ranson, some number (ß) of saes may have a backup appled. Whch ones? The mos deservng We keep a prory queue of whch saes have he bgg poenal for changng her (S) value Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 8 9

Where Are We? Tryng o do onlne predcon from sreams of ransons Space Updae Cos Daa Effcency: Supervsed Learnng Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng 0(N s ) 0(N so ) 0(N so ) 0(N so ) 0( ) 0(N so N s ) 0(N so k CRIT ) 0() 0() log(/?) N so = # sae-oucomes (number of arrows on he M.S. dagram) N s = # saes Wha Nex? Sample Backups!!! Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 9 Temporal Dfference Learnng Only manan a array nohng else [Suon 988] So you ve go (S ) (S 2 ), (S N ) and you observe S r S wha should you do? Can You Guess? A ranson from ha receves an mmedae reward of r and umps o Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 20 0

S r S TD Learnng ( S ) We updae = We nudge o be closer o expeced fuure rewards α = ( S ) ( α) ( S ) + Expeced fuure α[ ] rewards ( α) ( S ) + α[ r + γ ( S )] WEIGHTED SUM s called a learnng rae parameer. (See? n he neural lecure) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2 Smplfed TD Analyss P =? r =? TERMINATE S 0 P 2 =? r 2 =? TERMINATE r=0 P M =? : r M =? TERMINATE Suppose you always begn n S 0 You hen ranson a random o one of M places. You don know he ranson probs. You hen ge a place-dependen reward (unknown n advance). Then he ral ermnaes. Defne *(S 0 )= Expeced reward Le s mae wh TD Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 22

S 0 r=0 p () p (2) p (N) r () r (2) r (N) r (k) = reward of k h ermnal sae p (k) = prob of k h ermnal sae We ll do a seres of rals. Reward on h = Ε r n = p k= k r k ral s r [ ] ( ) ( ) [ ] [ Noe Ε r s ndependen of ] Defne *(S 0 ) = * = E[r ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 23 Le s run TD-Learnng, where = Esmae (S 0 ) before he h ral. From defnon of TD-Learnng: + = (-a) + ar Useful quany: Defne σ 2 = Varance of reward= Ε M = P k= ( r ) 2 ( k ) ( k ) [ ] 2 ( r ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 24 2

Ε = Remember * = E[r ], s 2 = E[(r -*) 2 ] + = ar + (-a) [ + ] = [ αr + ( α ) ] = Ε ( α ) Ε[ ] Thus... lm Ε [ ] = Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 25 W H Y? Is hs mpressve?? Remember * = E[r ], s 2 = E[(r -*) 2 ] + = ar + (-a) Wre S = Expeced squared error beween and * before he h eraon S + = E[( + -*) 2 ] = E[(ar +(-a) - *) 2 ] = E[(a[r -*]+(-a)[ - *]) 2 ] = E[a 2 (r -*) 2 +a(-a)(r -*)( - *)+(-a) 2 ( - *) 2 ] = a 2 E[(r -*) 2 ]+a(-a)e[(r -*)( - *)]+(-a) 2 E[( - *) 2 ] = = a 2 s 2 +(-a) 2 S WHY? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 26 3

And s hus easy o show ha. lm S = lm Ε [( ) ] 2 2 ασ = (2 α) Wha do you hnk of TD learnng? How would you mprove? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 27 Decayng Learnng Rae [Dayan 99sh] showed ha for General TD learnng of a Markow Sysem (no us our smple model) ha f you use updae rule ( S ) α [ r + γ ( S )] + ( α ) ( S ) hen, as number of observaons goes o nfny S S PROVIDED All saes vsed 8 ly ofen α = = ( ) ( ) 2 T 2 α < k. T. α < k = = Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 28 k. T. Ths means T = Ths means α > k 4

Decayng Learnng Rae Ths Works: a = / Ths Doesn : a = a 0 Ths Works: a = ß/(ß+) [e.g. ß=000] Ths Doesn : a = ßa - (ß<) IN OUR EXAMPLE.USE a = / 2 2 Remember = Ε[ r ], σ = Ε[ (r ) ] + Wre = α r C C + ( α ) = r + ( ) = + ( ) = r + C and so you' ll + = r + = see ha 0 And Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 29 Decayng Learnng Rae con 2 2 2 σ + [ ] ( 0 - ) = 2 lm Ε[ ( - ) ] = 0 Ε ( - ) so, ulmaely Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 30 5

A Fancer TD Wre S[] = sae a me Suppose a = /4? = /2 Assume (S 23 )=0 (S 7 )=0 (S 44 )=6 Assume = 405 and S[] = S 23 Observe (r=0) S 23 S 7 wh reward 0 Now = 406, S[] = S 7, S[-] = S 23 (S 23 )=, (S 7 )=, (S 44 )= (r=0) Observe S 7 S 44 Now = 407, S[] = S44 (S 23 )=, (S 7 )=, (S 44 )= INSIGHT: (S 23 ) mgh hnk I goa ge me some of ha!!! Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 TD(?) Commens TD(?=0) s he orgnal TD TD(?=) s almos he same as supervsed learnng (excep uses a learnng rae nsead of explc couns) TD(?=0.7) s ofen emprcally he b performer Dayan s proof holds for all 0=?= Updaes can be made more compuaonally effcen wh elgbly races (smlar o O.S.L.) Quon: Can you nven a problem ha would make TD(0) look bad and TD() look good? How abou TD(0) look good & TD() bad?? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 32 6

Learnng M.S. Summary MODEL-BASED Supervsed Learnng Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng Space 0(N s ) 0(N so ) 0(N so ) 0(N so ) Updae Cos 0 log γ 0(N so N s ) 0(N so k CRIT ) 0() 0() Daa Effcency MODEL FREE TD(0) TD(?), 0<?= 0(N s ) 0(N s ) 0() 0 log γλ Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 33 Learnng Polces for MDPs See prevous lecure sldes for defnon of and compuaon wh MDPs. The Hear of REINFORCEMENT Learnng sae Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 34 7

The ask: World: You are n sae 34. Your mmedae reward s 3. You have 3 acons. Robo: I ll ake acon 2. World: You are n sae 77. Your mmedae reward s -7. You have 2 acons. Robo: I ll ake acon. World: You re n sae 34 (agan). Your mmedae reward s 3. You have 3 acons. The Markov propery means once you ve seleced an acon he P.D.F. of your nex sae s he same as he las me you red he acon n hs sae. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 35 The Cred Assgnmen Problem I m n sae 43, 39, 22, 2, 2, 3, 54, 26, reward = 0, = 0, = 0, = 0, = 0, = 0, = 0, = 00, acon = 2 = 4 = = = = 2 = 2 Yppee! I go o a sae wh a bg reward! Bu whch of my acons along he way acually helped me ge here?? Ths s he Cred Assgnmen problem. I makes Supervsed Learnng approaches (e.g. Boxes [Mche & Chambers]) very, very slow. Usng he MDP assumpon helps avod hs problem. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 36 8

Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng MDP Polcy Learnng 0(N sao ) 0(N sao ) 0(N sao ) Space Updae Cos 0(N sao k CRIT ) 0(N?0 ) 0(ßN?0 ) We ll hnk abou Model-Free n a momen Daa Effcency The C.E. mehods are very smlar o he MS case, excep now do value-eraon-for-mdp backups = a S ( S ) max r + γ P ( S S, a) ( S ) SUCCS ( ) S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 37 Choosng Acons We re n sae S We can mae r P (nex = S hs = S, acon a) (nex = S ) So wha acon should we choose? IDEA : IDEA 2 : a = arg max r + γ P a a = random Any problems wh hese deas? Any oher suggons? Could we be opmal? ( S S, a ) ( S ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 38 9

Model-Free R.L. Why no use T.D.? Observe updae r S a S ( S ) α ( r + γ ( S )) + ( α ) ( S ) Wha s wrong wh hs? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 39 Q-Learnng: Model-Free R.L. [Wakns, 988] Defne Q*(S,a)= Expeced sum of dscouned fuure rewards f I sar n sae S, f I hen ake acon a, and f I m subsequenly opmal Quons: Defne Q*(S,a) n erms of * Defne *(S ) n erms of Q* Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 40 20

Q Noe ha Q Q-Learnng Updae ( S, a) = r + ( ) Q ( a γ P S S, α max S, ) S SUCCS ( S ) In Q-learnng we manan a able of Q values nsead of values When you see S ( S, a) α r γ max Q ( S, a ) + ( α ) Q ( S, a) + reward acon a a S do Ths s even cleverer han looks: he Q values are no based by any parcular exploraon polcy. I avods he Cred Assgnmen problem. a Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 Q-Learnng: Choosng Acons Same ssues as for CE choosng acons Don always be greedy, so don always choose: arg max Q a Don always be random (oherwse wll ake a long me o reach somewhere excng) Bolzmann exploraon [Wakns] Q Prob(choose acon a) exp K ( s, a) Opmsm n he face of uncerany [Suon 90, Kaelblng 90] Inalze Q-values opmscally hgh o encourage exploraon Or ake no accoun how ofen each s,a par has been red ( ) s, a Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 42 2

Q-Learnng Commens [Wakns] proved ha Q-learnng wll evenually converge o an opmal polcy. Emprcally s cue Emprcally s very slow Why no do Q(?)? Would no make much sense [renroduce he cred assgnmen problem] Some people (e.g. Peng & Wllams) have red o work her way around hs. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 43 If we had me Value funcon approxmaon Use a Neural Ne o represen [e.g. Tesauro] Use a Neural Ne o represen Q [e.g. Cres] Use a decson ree wh Q-learnng [Chapman + Kaelblng 9] wh C.E. learnng [Moore 9] How o spl up space? Sgnfcance on Q values [Chapman + Kaelblng] Execuon accuracy monorng [Moore 9] Game Theory [Moore + Akeson 95] New nfluence/varance crera [Munos 99] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 44 22

R.L. Theory If we had me Counerexamples [Boyan + Moore], [Bard] Value Funcon Approxmaors wh Averagng wll converge o somehng [Gordon] Neural Nes can fal [Bard] Neural Nes wh Resdual Graden updaes wll converge o somehng Lnear approxmaors for TD learnng wll converge o somehng useful [Tsskls + Van Roy] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 45 Wha You Should Know Supervsed learnng for predcng delayed rewards Cerany equvalen learnng for predcng delayed rewards Model free learnng (TD) for predcng delayed rewards Renforcemen Learnng wh MDPs: Wha s he ask? Why s hard o choose acons? Q-learnng (ncludng beng able o work hrough small smulaed examples of RL) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 46 23