Reinforcement Learning and Policy Reuse

Similar documents
Reinforcement learning

Reinforcement Learning

Artificial Intelligence Markov Decision Problems

Reinforcement learning II

Bellman Optimality Equation for V*

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

2D1431 Machine Learning Lab 3: Reinforcement Learning

Administrivia CSE 190: Reinforcement Learning: An Introduction

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

19 Optimal behavior: Game theory

Markov Decision Processes

Bayesian Networks: Approximate Inference

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

1 Online Learning and Regret Minimization

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Policy Gradient Methods for Reinforcement Learning with Function Approximation

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

CS 188: Artificial Intelligence Spring 2007

Reinforcement Learning for Robotic Locomotions

Bias in Natural Actor-Critic Algorithms

CS 188: Artificial Intelligence Fall 2010

CS 188: Artificial Intelligence

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

4-4 E-field Calculations using Coulomb s Law

dt. However, we might also be curious about dy

Efficient Planning in R-max

Chapter 5 : Continuous Random Variables

MArkov decision processes (MDPs) have been widely

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Near-Bayesian Exploration in Polynomial Time

We will see what is meant by standard form very shortly

Uninformed Search Lecture 4

Lecture 21: Order statistics

Analysis of Variance and Design of Experiments-II

Jonathan Mugan. July 15, 2013

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS 188: Artificial Intelligence Fall Announcements

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

A Fast and Reliable Policy Improvement Algorithm

Integral equations, eigenvalue, function interpolation

CS5371 Theory of Computation. Lecture 20: Complexity V (Polynomial-Time Reducibility)

ROB EBY Blinn College Mathematics Department

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

NUMERICAL INTEGRATION

Improper Integrals, and Differential Equations

Extended nonlocal games from quantum-classical games

p-adic Egyptian Fractions

Applying Q-Learning to Flappy Bird

LECTURE NOTE #12 PROF. ALAN YUILLE

Actor-Critic. Hung-yi Lee

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE

Learning to Serve and Bounce a Ball

Hidden Markov Models

For the percentage of full time students at RCC the symbols would be:

Exploring Continuous Action Spaces with Diffusion Trees for Reinforcement Learning

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

Scalable Learning in Stochastic Games

13: Diffusion in 2 Energy Groups

The ifs Package. December 28, 2005

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Continuous Random Variables

CS 330 Formal Methods and Models

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

KNOWLEDGE-BASED AGENTS INFERENCE

Pi evaluation. Monte Carlo integration

Review of Calculus, cont d

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

Reward Shaping for Model-Based Bayesian Reinforcement Learning

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Numerical Analysis: Trapezoidal and Simpson s Rule

Today. Recap: Reasoning Over Time. Demo Bonanza! CS 188: Artificial Intelligence. Advanced HMMs. Speech recognition. HMMs. Start machine learning

arxiv: v1 [stat.ml] 9 Aug 2016

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

Data Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Section 6.1 Definite Integral

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

Where did dynamic programming come from?

Name Solutions to Test 3 November 8, 2017

20 MATHEMATICS POLYNOMIALS

1 The Riemann Integral

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Designing Information Devices and Systems I Discussion 8B

Robot Planning in Partially Observable Continuous Domains

Homework 3 Solutions

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Non-Linear & Logistic Regression

Robot Planning in Partially Observable Continuous Domains

Numerical integration

PHYSICS 211 MIDTERM I 22 October 2003

DIRECT CURRENT CIRCUITS

Reinforcement Learning

Transcription:

Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez nd Mnuel Veloo. In Proceeding of AAMAS 06. (Thnk to Fernndo Fernndez) Lerning Lerning from experience Supervied lerning Lbeled exmple Rewrd/reinforcement Something good/bd (poitive/negtive rewrd) hppen An gent get rewrd prt of the input percept but it i progrmmed to undertnd it rewrd. Reinforcement extenively tudied by niml pychologit.

Reinforcement Lerning The problem of getting n gent to ct in the world o to mximize it rewrd. Teching dog new trick: you cnnot tell it wht to do but you cn rewrd/punih it if it doe the right/wrong thing. Lerning: to figure out wht it did tht mde it get the rewrd/ punihment: the credit ignment problem. RL: imilr method to trin computer to do mny tk. Reinforcement Lerning Tk Aume the world i Mrkov Deciion Proce Stte nd ction known Trnition nd rewrd unknown Full obervbility Objective Lern ction policy π : S A Mximize expected rewrd E[r t γr t γ 2 r t2...] from ny trting tte in S. 0 γ < dicount fctor for future rewrd 2

Reinforcement Lerning Problem Agent ee the tte elect nd ction nd get rewrd Gol: Lern to chooe ction tht mximize r 0 γr γ 2 r 2... where 0 γ < Online Lerning Approche Cpbilitie Execute ction in world Oberve tte of world Two Lerning Approche Model-bed Model-free 3

Model-Bed Reinforcement Lerning Approch Lern the MDP Solve the MDP to determine optiml policy Approprite when model i unknown but mll enough to olve feibly Lerning the MDP Etimte the rewrd nd trnition ditribution Try every ction ome number of time Keep count (frequentit pproch) R() = R /N T( ) = N /N Solve uing vlue or policy itertion Itertive Lerning nd Action Mintin ttitic incrementlly Solve the model periodiclly 4

Model-Free Reinforcement Lerning Lern policy mpping directly Approprite when model i too lrge to tore olve or lern Do not need to try every tte/ction in order to get good policy Converge to optiml policy Vlue Function For ech poible policy π define n evlution function over tte V π ( ) r t γr 2 t γ rt... i= 0 γ i rt i where r t r t... re generted by following policy π trting t tte π* rgmx π V π () ( ) Lerning tk: Lern OPTIMAL policy 5

Lern Vlue Function Lern the evlution function V π * (i.e. V*) Select the optiml ction from ny tte i.e. hve n optiml policy by uing V* with one tep lookhed: [ ] * ( ) = rgmx r( ) V ( δ ( ) ) * π γ But rewrd nd trnition function re unknown Q Function Define new function very imilr to V* Q() r() γv*(δ()) Lern Q function Q-lerning If gent lern Q it cn chooe optiml ction even without knowing δ or r π * [ ] * ( ) = rgmx r( ) γv ( δ ( ) ) * π ( ) = rgmxq( ) 6

Q-Lerning Q nd V*: V We cn write Q recurively: ( ) = Q( ʹ) mx ʹ ( t t ) = r( t t ) γv ( δ ( t t )) = r( t t ) γ mxq( ʹ) Q t Q-lerning ctively generte exmple. It procee exmple by updting it Q vlue. While lerning Q vlue re pproximtion. ʹ Trining Rule to Lern Q (Determinitic Exmple) Let Q denote ˆ current pproximtion to Q. Then Q-lerning ue the following trining rule: ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ where ʹ i the tte reulting from pplying ction in tte nd r i the rewrd tht i returned. 7

Determinitic Ce Exmple Determinitic Ce Exmple ( ) r γ mx Qˆ ( ʹ) Qˆ right 90 ʹ 0 0.9 mx 2 { 63 800 } 8

Q Lerning Itertion Strt t top left corner with fixed policy clockwie Initilly Q() = 0; γ = 0.8 ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ Q ( E) Q (2 E) Q (3 S) Q(4 W) Q Lerning Itertion Strt t top left corner with fixed policy clockwie Initilly Q() = 0; γ = 0.8 ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ 9

0 Nondeterminitic Ce Q lerning in nondeterminitic world Redefine V Q by tking expected vlue: ( ) [ ] ( ) ( ) ( ) ( ) [ ] V E r Q r E r r E r V i i t i t t t... * 0 2 2 δ γ γ γ γ π = Nondeterminitic Ce Q lerning trining rule: ( ) ( ) ( ) ( ) ( ) ( ) (Wtkin nd Dyn992) till converge to ˆ. nd where ˆ mx ˆ ˆ * n Q Q Q r Q Q viitn n n n n n δ α γ α α = ʹ = ʹ ʹ ʹ

Explortion v Exploittion Tenion between lerning optiml trtegy nd uing wht you know o fr to mximize expected rewrd Convergence theorem depend on viiting ech tte ufficient number of time Typiclly ue reinforcement lerning while performing tk Explortion policy Wcky pproch: ct rndomly in hope of eventully exploring entire environment Greedy pproch: ct to mximize utility uing current etimte Blnced pproch: ct more wcky when gent h not much knowledge of environment nd more greedy when the gent h cted in the environment longer One-rmed bndit problem

Explortion Strtegie ε-greedy Exploit with probbility -ε Chooe remining ction uniformly Adjut ε lerning continue Boltzmn Chooe ction with probbility p = Q e e ( ) ( ' ) where t cool over time (imulted nneling) All method enitive to prmeter choice nd chnge ' Q / t / t Policy Reue Impct of chnge of rewrd function Doe not wnt to lern from crtch Trnfer lerning Lern mcro of the MPD option Vlue function trnfer Explortion bi Reue complete policie 2

Epiode MDP with borbing gol tte Trnition probbility from gol tte to the me gol tte i (therefore to ny other tte i 0) Epiode: Strt in rndom tte end in borbing tte Rewrd per epiode (K epiode H tep ech): Domin nd Tk 3

Policy Librry nd Reue π-reue Explortion 4

π-reue Policy Lerning Experimentl Reult 5

Reult Policy Reue in Q-Lerning Interetingly the pi-reue trtegy lo contribute imilrity metric between policie The gin Wi obtined while executing the pi-reue explortion trtegy reuing the pt policy i. Wi i n etimtion of how imilr the policy i i to the new one! The et of Wi vlue for ech of the policie in the librry i unknown priori but it cn be etimted on-line while the new policy i computed in the different epiode. 6

Lerning to Ue Policy Librry Similrity between policie cn be lerned Gin of uing ech policy Explore different policie Lern domin tructure: eigen policie 7

Summry Reinforcement lerning Q-lerning Policy Reue Next cl: Other reinforcement lerning lgorithm (There re mny ) 8