Chapter 2: Evaluative Feedback

Similar documents
Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

3. Renewal Limit Theorems

Reinforcement learning

Minimum Squared Error

4.8 Improper Integrals

Minimum Squared Error

Reinforcement Learning

e t dt e t dt = lim e t dt T (1 e T ) = 1

0 for t < 0 1 for t > 0

Probability, Estimators, and Stationarity

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

A Kalman filtering simulation

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

Reinforcement Learning. Markov Decision Processes

5.1-The Initial-Value Problems For Ordinary Differential Equations

S Radio transmission and network access Exercise 1-2

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Bellman Optimality Equation for V*

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ]

Contraction Mapping Principle Approach to Differential Equations

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions

Chapter Direct Method of Interpolation

Average & instantaneous velocity and acceleration Motion with constant acceleration

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL

( ) ( ) ( ) ( ) ( ) ( y )

REAL ANALYSIS I HOMEWORK 3. Chapter 1

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

Physics 2A HW #3 Solutions

Solutions to Problems from Chapter 2

( dg. ) 2 dt. + dt. dt j + dh. + dt. r(t) dt. Comparing this equation with the one listed above for the length of see that

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

One Practical Algorithm for Both Stochastic and Adversarial Bandits

Reinforcement learning II

A new model for limit order book dynamics

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Administrivia CSE 190: Reinforcement Learning: An Introduction

Lecture 2 October ε-approximation of 2-player zero-sum games

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

Forms of Energy. Mass = Energy. Page 1. SPH4U: Introduction to Work. Work & Energy. Particle Physics:

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

THREE IMPORTANT CONCEPTS IN TIME SERIES ANALYSIS: STATIONARITY, CROSSING RATES, AND THE WOLD REPRESENTATION THEOREM

AQA Maths M2. Topic Questions from Papers. Differential Equations. Answers

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Process Monitoring and Feedforward Control for Proactive Quality Improvement

Introduction to LoggerPro

1.0 Electrical Systems

Chapter 7: Solving Trig Equations

Linear Time-invariant systems, Convolution, and Cross-correlation

A LOG IS AN EXPONENT.

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

September 20 Homework Solutions

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

RESPONSE UNDER A GENERAL PERIODIC FORCE. When the external force F(t) is periodic with periodτ = 2π

Dipartimento di Elettronica Informazione e Bioingegneria Robotics

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Solutions for Assignment 2

1. Introduction. 1 b b

Linear Response Theory: The connection between QFT and experiments

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

Ensamble methods: Bagging and Boosting

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

Aho-Corasick Automata

Machine Learning Reinforcement Learning

{ } = E! & $ " k r t +k +1

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Chapter 4: Dynamic Programming

6.003 Homework #9 Solutions

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

LAPLACE TRANSFORMS. 1. Basic transforms

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

MTH 146 Class 11 Notes

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

Laplace Transforms. Examples. Is this equation differential? y 2 2y + 1 = 0, y 2 2y + 1 = 0, (y ) 2 2y + 1 = cos x,

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Homework-8(1) P8.3-1, 3, 8, 10, 17, 21, 24, 28,29 P8.4-1, 2, 5

Mathematics 805 Final Examination Answers

Vidyalankar. 1. (a) Y = a cos dy d = a 3 cos2 ( sin ) x = a sin dx d = a 3 sin2 cos slope = dy dx. dx = y. cos. sin. 3a sin cos = cot at = 4 = 1

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

USING ITERATIVE LINEAR REGRESSION MODEL TO TIME SERIES MODELS

1 Online Learning and Regret Minimization

Temperature Rise of the Earth

6.003 Homework #8 Solutions

Ensamble methods: Boosting

6.003 Homework #9 Solutions

Estimating the population parameter, r, q and K based on surplus production model. Wang, Chien-Hsiung

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Deep Reinforcement Learning with Double Q-Learning

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

Efficient Optimal Learning for Contextual Bandits

AJAE appendix for Is Exchange Rate Pass-Through in Pork Meat Export Prices Constrained by the Supply of Live Hogs?

Transcription:

Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is insrucive; opimizion is evluive Associive vs. Nonssociive: Associive: inpus mpped o oupus; lern he bes oupu for ech inpu Nonssociive: lern (find) one bes oupu n-rmed bndi ( les how we re i) is: Nonssociive Evluive feedbck R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 1

The n-armed Bndi Problem Choose repeedly from one of n cions; ech choice is clled ply Afer ech ply, you ge rewrd, where E r = Q * ( ) These re unknown cion vlues Disribuion of depends only on Objecive is o mximize he rewrd in he long erm, e.g., over 1000 plys To solve he n-rmed bndi problem, you mus explore vriey of cions nd he exploi he bes of hem r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 2

The Explorion/Exploiion Dilemm Suppose you form esimes Q ( ) * Q ( ) cion vlue esimes The greedy cion is You cn exploi ll he ime; you cn explore ll he ime You cn never sop exploring; bu you should lwys reduce exploring * = rg mxq ( ) * * = exploiion explorion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 3

Acion-Vlue Mehods Mehods h dp cion-vlue esimes nd nohing else, e.g.: suppose by he -h ply, cion hd been chosen k imes, producing rewrds r r r 1, 2, K, k, hen Q ( ) = r + r + r 1 2 L k k smple verge k lim * Q ( ) = Q ( ) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 4

ε-greedy Acion Selecion Greedy cion selecion: * = = rg mxq ( ) ε-greedy: = { * wih probbiliy 1 ε rndom cion wih probbiliy ε... he simples wy o ry o blnce explorion nd exploiion R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 5

10-Armed Tesbed n = 10 possible cions Ech ech Q * ( ) r 1000 plys is chosen rndomly from norml disribuion: is lso norml: * η( Q ( ), 1) repe he whole hing 2000 imes nd verge he resuls η( 0, 1) R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 6

ε-greedy Mehods on he 10-Armed Tesbed Averge rewrd 1.5 1 0.5 = 0.1 = 0.01 0 0 250 500 750 1000 Plys 100% 80% = 0.1 % Opiml cion 60% 40% 20% = 0 (greedy) = 0.01 0% 0 250 500 750 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 7

Sofmx Acion Selecion Sofmx cion selecion mehods grde cion probs. by esimed vlues. The mos common sofmx uses Gibbs, or Bolzmnn, disribuion: Choose cion on ply wih probbiliy e Q n b= 1 ( ) τ where τ is he compuionl emperure e Q ( b) τ, R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 8

Binry Bndi Tsks Suppose you hve jus wo cions: nd jus wo rewrds: r = success or r = = 1 or = 2 filure Then you migh infer rge or desired cion: d = { he oher cion if success if filure nd hen lwys ply he cion h ws mos ofen he rge Cll his he supervised lgorihm I works fine on deerminisic sks R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 9

Coningency Spce The spce of ll possible binry bndi sks: 1 EASY PROBLEMS B DIFFICULT PROBLEMS Success probbiliy for cion 2 0.5 DIFFICULT PROBLEMS EASY PROBLEMS 0 A 0 1 0.5 Success probbiliy for cion 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 10

Liner Lerning Auom Le ( ) = Pr = be he only dped prmeer π { } L (Liner, rewrd - incion) R I On success : π + 1( ) = π ( ) + α( 1 π ( )) 0 < α < 1 (he oher cion probs. re djused o sill sum o 1) On filure : no chnge L (Liner, rewrd - penly) R-P On success : π ( ) = π ( ) + α( 1 π ( )) 0 < α < 1 + 1 (he oher cion probs. re djused o sill sum o 1) On filure : π ( ) = π ( ) + α( 0 π ( )) 0 < α < 1 + 1 For wo cions, sochsic, incremenl version of he supervised lgorihm R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 11

Performnce on Binry Bndi Tsks A nd B 100% 90% BANDIT A L R-I cion vlues % Opiml cion 80% 70% 60% supervised 50% L R-P 0 100 200 300 400 500 Plys 100% 90% BANDIT B cion vlues % Opiml cion 80% 70% L R-I L R-P 60% supervised 50% 0 100 200 300 400 500 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 12

Incremenl Implemenion Recll he smple verge esimion mehod: The verge of he firs k rewrds is (dropping he dependence on ): Q k = r1 + r2 + Lr k k Cn we do his incremenlly (wihou soring ll he rewrds)? We could keep running sum nd coun, or, equivlenly: 1 Q + 1 = Q + r + 1 Q k + 1 [ ] k k k k This is common form for upde rules: NewEsime = OldEsime + SepSize[Trge OldEsime] R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 13

Trcking Nonsionry Problem Choosing Q k o be smple verge is pproprie in sionry problem, i.e., when none of he Q * ( ) chnge over ime, Bu no in nonsionry problem. Beer in he nonsionry cse is: [ ] Q = Q + α r Q k + 1 k k + 1 k for consn α, 0 < α 1 k = ( 1 α) Q + α( 1 α) 0 k i= 1 k i exponenil, recency-weighed verge r i R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 14

Opimisic Iniil Vlues All mehods so fr depend on Q ( 0 ), i.e., hey re bised. Suppose insed we iniilize he cion vlues opimisiclly, i.e., on he 10-rmed esbed, use Q0 ( ) = 5 for ll 100% 80% opimisic, greedy Q 0 = 5, = 0 % Opiml cion 60% 40% relisic, ε-greedy Q 0 = 0, = 0.1 20% 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 15

Reinforcemen Comprison Compre rewrds o reference rewrd, verge of observed rewrds, e.g., n Srenghen or weken he cion ken depending on Le p( ) denoe he preference for cion Preferences deermine cion probbiliies, e.g., by Gibbs disribuion: p ( ) e π ( ) = Pr{ = } = n p ( b) e Then: b= 1 [ ] = + [ ] p ( ) = p ( ) + r r nd r r α r r + 1 + 1 r r r R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 16

Performnce of Reinforcemen Comprison Mehod 100% 80% reinforcemen comprison % Opiml cion 60% 40% 20% -greedy = 0.1, α = 1/k -greedy = 0.1, α = 0.1 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 17

Pursui Mehods Minin boh cion-vlue esimes nd cion preferences Alwys pursue he greedy cion, i.e., mke he greedy cion more likely o be seleced Afer he -h ply, upde he cion vlues o ge * The new greedy cion is = rg mxq ( ) + 1 + 1 Q +1 Then: [ ] π ( * ) = π ( * ) + β 1 π ( * ) + 1 + 1 + 1 + 1 nd he probs. of he oher cions decremened o minin he sum of 1 R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 18

Performnce of Pursui Mehod % Opiml cion 100% 80% 60% 40% 20% pursui reinforcemen comprison -greedy = 0.1, α = 1/k 0% 0 200 400 600 800 1000 Plys R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 19

Associive Serch Imgine swiching bndis ech ply Bndi 3 cions R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 20

Conclusions These re ll very simple mehods bu hey re compliced enough we will build on hem Ides for improvemens: esiming uncerinies... inervl esimion pproximing Byes opiml soluions Giens indices The full RL problem offers some ides for soluion... R. S. Suon nd A. G. Bro: Reinforcemen Lerning: An Inroducion 21