Reinforcement Learning

Similar documents
Reinforcement learning II

2D1431 Machine Learning Lab 3: Reinforcement Learning

19 Optimal behavior: Game theory

1 Online Learning and Regret Minimization

Bellman Optimality Equation for V*

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Administrivia CSE 190: Reinforcement Learning: An Introduction

Reinforcement Learning and Policy Reuse

Reinforcement learning

CS 188: Artificial Intelligence Spring 2007

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Artificial Intelligence Markov Decision Problems

Uninformed Search Lecture 4

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Hidden Markov Models

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

Applying Q-Learning to Flappy Bird

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Chapter 6 Notes, Larson/Hostetler 3e

LECTURE NOTE #12 PROF. ALAN YUILLE

Continuous Random Variables

Jonathan Mugan. July 15, 2013

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Chapter 0. What is the Lebesgue integral about?

Chapter 5 Plan-Space Planning

Math 1B, lecture 4: Error bounds for numerical methods

Coalgebra, Lecture 15: Equations for Deterministic Automata

Scalable Learning in Stochastic Games

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Bayesian Networks: Approximate Inference

Review of Calculus, cont d

More on automata. Michael George. March 24 April 7, 2014

Acceptance Sampling by Attributes

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Chapter 5 : Continuous Random Variables

CS 188: Artificial Intelligence

1B40 Practical Skills

CS 188: Artificial Intelligence Fall 2010

New Expansion and Infinite Series

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1

Nondeterminism and Nodeterministic Automata

WE would like to build intelligent agents that can. Autonomous Learning of High-Level States and Actions in Continuous Environments

Learning to Serve and Bounce a Ball

The Regulated and Riemann Integrals

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Riemann Sums and Riemann Integrals

Student Activity 3: Single Factor ANOVA

KNOWLEDGE-BASED AGENTS INFERENCE

Tests for the Ratio of Two Poisson Rates

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Riemann Integrals and the Fundamental Theorem of Calculus

Problem Set 3 Solutions

New data structures to reduce data size and search time

Problem Set 7: Monopoly and Game Theory

Riemann Sums and Riemann Integrals

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Chapter 3 Solving Nonlinear Equations

Local orthogonality: a multipartite principle for (quantum) correlations

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Recitation 3: More Applications of the Derivative

Power Constrained DTNs: Risk MDP-LP Approach

REINFORCEMENT learning (RL) was originally studied

Chapters 4 & 5 Integrals & Applications

Situation Calculus. Situation Calculus Building Blocks. Sheila McIlraith, CSC384, University of Toronto, Winter Situations Fluents Actions

CS:4330 Theory of Computation Spring Regular Languages. Equivalences between Finite automata and REs. Haniel Barbosa

Name Solutions to Test 3 November 8, 2017

Arithmetic & Algebra. NCTM National Conference, 2017

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

Lecture 3 ( ) (translated and slightly adapted from lecture notes by Martin Klazar)

Chapter 2 Finite Automata

Numerical Integration

Metrics for Finite Markov Decision Processes

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE

Best Approximation. Chapter The General Case

Actor-Critic. Hung-yi Lee

MATH 144: Business Calculus Final Review

A Fast and Reliable Policy Improvement Algorithm

1. Gauss-Jacobi quadrature and Legendre polynomials. p(t)w(t)dt, p {p(x 0 ),...p(x n )} p(t)w(t)dt = w k p(x k ),

Non-Linear & Logistic Regression

Chapter 14. Matrix Representations of Linear Transformations

Math 270A: Numerical Linear Algebra

Do the one-dimensional kinetic energy and momentum operators commute? If not, what operator does their commutator represent?

Learning Moore Machines from Input-Output Traces

Bisimulation. R.J. van Glabbeek

Transcription:

Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm Convergence proof Explortion vs exploittion Non-deterministic rewrds nd ctions Copyright Fcundo Bromberg 2 Introduction How n utonomous gent tht senses nd cts in its environment cn lern to choose optiml ctions to chieve its gol. Rewrd Critic Agent Action Stte Environment s 0 s 1 s 2 0 1 2 r 0 r 1 r 2 Gol: Lern to choose ctions tht mximize r 0 +r 1 + 2 r 2 +..., where 0<1 Copyright Fcundo Bromberg 3 Copyright Fcundo Bromberg 1

Introduction This generic problem is one of lerning to control sequentil processes such s: Lerning to control mobile robot, Lerning to optimize opertions in fctorie nd Lerning to ply bord gmes (world-clss bckgmmon plyer, Tesuro 1995). Suitble for scenrios with unpredictble (e.g. highly dynmic) or complex environments. Lck of domin theory Impossible to build-in optiml behvior (like in plnning, optimiztion lg, etc). Copyright Fcundo Bromberg 4 Introduction Assumption: gols cn be defined by rewrd function tht ssigns numericl vlues to ctionstte pirs. This rewrd function is known by the critic who could be externl or built-in. The tsk of the gent is to perform sequences of ction observe their consequence nd lern control policy : S A tht chooses ctions tht mximize the ccumulted rewrd. Copyright Fcundo Bromberg 5 Differences with inductive lerning Differences with function pproximtion (inductive lerning) of : S A: Delyed rewrd. Insted of pirs <(s)>, gent Optiml ction for current stte s. receives sequence of rewrds nd fces the problem of temporl credit ssignment. Explortion. The gent hs influence on the distribution of trining exmples by the ction sequence it chooses. Rises problem of explortion vs exploittion. Copyright Fcundo Bromberg 6 Copyright Fcundo Bromberg 2

Differences with inductive lerning Prtilly observble sttes. In mny prcticl situtions sensors provide only prtil informtion of the environment s stte. Optiml policy my therefore include specificlly ctions tht improve observbility of environment. Life-long lerning. Unlike isolted inductive lerning tsk gent lerning often requires lerning of severl relted tsks within the sme environment. Prior knowledge or experience become relevnt. Copyright Fcundo Bromberg 7 The model (or the tsk of mking compromises) Deterministic or non-deterministic ctions? Prior knowledge or not bout effects of its ctions on the environment (domin theory)? Triner gives exmples of optiml ction sequences (inductive lerning), or it must trin itself? The choice: Mrkov Decision Process (MDP). Copyright Fcundo Bromberg 8 Mrkov Decision Processes An MDP is tuple (S, A, s 0,, r), S is the set of stte A is the set of ctions vilble to the gent, s 0 is the initil stte, : S x A S, is the trnsition function, nd r: S x A R + is the rewrd function. r nd depend only on current stte nd ction (Mrkov property), nd they re deterministic. Copyright Fcundo Bromberg 9 Copyright Fcundo Bromberg 3

2 The tsk The tsk of the gent is to lern policy : S A tht selects next ction t bsed on current observed stte s t, i.e. (s t )= t. How? Policy tht mximizes cumultive rewrd over time. Tht i policy tht mximizes: V (s t ) =r t + r t+1 + 2 r t+1 + = i r, 0 1 = + i 0 t < i Where the sequence of rewrds ws generted by 0 s0 r s1 1 ) r s2 ) r = 0 1 ( s = ( s = ( s 2 0 1 2 Copyright Fcundo Bromberg 10 The tsk (2) Alterntive definitions of totl rewrd: Finite horizon rewrd: Averge rewrd: lim h = r i 0 t + i 1 h h h i = r 0 t + i Copyright Fcundo Bromberg 11 The optiml policy The tsk of the gent is thus to lern the optiml policy given by: rg mxv, And we denote by V (s)=v (s) the mximum rewrd the gent cn obtin strting t s. Copyright Fcundo Bromberg 12 Copyright Fcundo Bromberg 4

Exmple S: cells A: rrows. r: numbers by rrows. V (s bottom-right )=100 V (s bottom-center )=0+0.9 100= 90 V (s bottom-left )=0+0+0.9 2 100= 81 Since G is bsorbing, infinite sum becomes finite. Copyright Fcundo Bromberg 13 Q Lerning Agent wnts to mximize cumultive rewrd, thu it should prefer stte s 1 over s 2 whenever V (s 1 )>V (s 2 ). However, gent s policy must choose mong ction not sttes. No problem: The optiml ction in stte s is the ction tht mximizes the sum of the immedite rewrd r( plus the vlue V of the immedite successor stte s, discounted by. s, successor = rg mx [ r( + V ( ( )] immedite rewrd vlue of successor Copyright Fcundo Bromberg 14 Q Lerning rδ = rg mx[ ( + V ( ( )] Thu if gent knows functions r nd, it cn lern optiml policy by lerning vlue V offline vlue itertion lgorithm (skipped) For ll but few cse is unknown. Requires precise knowledge of the domin. Sometimes the domin is even non-deterministic!. Copyright Fcundo Bromberg 15 Copyright Fcundo Bromberg 5

The Q function If r or re unknown, wht evlution function should the gent use? The evlution function Q. rδ = rg mx[ ( + V ( ( )] Q( s, ) = rg mx Q( Thu if gent is cpble of lerning Q, it will be ble to select optiml ctions even when it hs no knowledge of r nd. Copyright Fcundo Bromberg 16 The Q function rδ = rg mx Q(, Q( = ( + V ( ( ) Surprisingly, gent cn choose optiml ction without ever conducting lookhed serch to explicitly consider wht sttes result from the ction. Ye Q function hs exctly tht property. Q( summrizes in single vlue ll the informtion needed to determine discounted cumultive rewrd tht will be gined in the future if ction is chosen in stte s. Copyright Fcundo Bromberg 17 Exmple rq( = δ ( + V ( ( ) S: cells A: rrows. r: numbers by rrows. Q(s G, )= 0 + 0.9 0 + Q(s bootom-right, )= 100 + 0= 100 Q(s bootom-center, )= 0 + 0.9 100= 90 Q(s bootom-left, )= 0 + 0 + (0.9) 2 100= 81 Since G is bsorbing, infinite sum becomes finite. Copyright Fcundo Bromberg 18 Copyright Fcundo Bromberg 6

Algorithm for lerning Q (Wtkins 1989) rδ = rg mx Q(, Q( = ( + V ( ( ) Lerning Q is equivlent to lern optiml policy. Note tht: V = mx [ r( ) + V ( ( ))] = mx Q( ) So we obtin recursive definition of Q, rδq( = ( + mx Q( ( ), ') The lgorithm lerns n pproximtion Qˆ of Q represented s tble with seprte entries for ech stte-ction pir. Copyright Fcundo Bromberg 19 Algorithm for lerning Q (cond.) For ech initilize tble entry Qˆ ( to zero. Observe current stte s Do forever: Agent observes its current stte Chooses some ction nd executes it Observes resulting rewrd r=r( nd new stte s =( Updtes the tble entry for Qˆ (, ccording to the rule: Qˆ( r + mxqˆ( s', ) s s Note tht Q-lerning propgtes Qˆ estimtes bckwrds from the new stte s to the old stte s. Copyright Fcundo Bromberg 20 Exmple 2. Q lerning S S 72 100 90 1 2 S 63 1 63 81 right S 100 2 81 Qˆ( s1, ) r + mx Qˆ( s2, ) right 0 + 0.9mx{63, 81,100} 90 Copyright Fcundo Bromberg 21 Copyright Fcundo Bromberg 7

Exmple 3. Q lerning Proceeding in episodes from s 0 to G, lwys through s 1, s 2. s 1 0 s 2 0 G s 0 0 Copyright Fcundo Bromberg 22 Exmple 3. Q lerning episode 1 s 1 0 s 2 100 G s 0 0 Copyright Fcundo Bromberg 23 Exmple 3. Q lerning episode 2 s 1 90 s 2 100 G s 0 0 Copyright Fcundo Bromberg 24 Copyright Fcundo Bromberg 8

Exmple 3. Q lerning episode 3 s 1 90 s 2 100 G s 0 81 Copyright Fcundo Bromberg 25 Convergence of Q lerning Will the lgorithm converge towrd Qˆ equl to the true Q function? Copyright Fcundo Bromberg 26 Convergence of Q lerning Qˆ Copyright Fcundo Bromberg 27 Copyright Fcundo Bromberg 9

Experimenttion strtegies in Q lerning Algorithm does not specify how ctions re chosen!. Exploittion: One possibility is for the gent t stte s to choose ction tht mximizes Qˆ (, thereby exploiting current pproximtion. With this strtegy, gent risks filing to explore other ction in other stte tht hve even higher vlues but hven t been visited yet. Moreover, theorem requires ction-stte pirs visited infinitely often. Copyright Fcundo Bromberg 28 Experimenttion strtegies in Q lerning (2) Explortion: Probbilistic pproch tht gives higher probbilities to higher Qˆ vlues. P( s) = i k Qˆ ( i ) where k > 0 determines how strongly the selection fvors ctions with high Qˆ vlues. j k Qˆ ( j ) High k exploit Low k explore Copyright Fcundo Bromberg 29 Nondeterministic rewrds nd ctions Noisy effector gmes with dice, etc. ( first produce distribution P : S A S then drws n outcome t rndom from P., nd Similrly, for r. We ssume these probbilities follows mrkov property. We retrce line of rgument tht led to the deterministic lgorithm, revising it where needed. Copyright Fcundo Bromberg 30 Copyright Fcundo Bromberg 10

Nondeterministic vlue function We define the nondeterministic vlue function V (s t ) for policy s the expected vlue of the discounted cumultive rewrd: V ( s ) E t i [ rt i ] i= 0 + As before, we define the optiml policy to be the policy tht mximizes V (s) for ll sttes s. rgmx V, Copyright Fcundo Bromberg 31 Nondeterministic vlue function And we generlize the erlier definition of Q by tking its expected vlue Q( E r [ ( + V ( ( ) ] = E[ r( ] + E[ V ( ( )] = E[ r( ] + s' P( s' V ( s' ) As before, we cn express Q recursively Q( = E[ r( ] + s' P( s' mxq( s', ) Copyright Fcundo Bromberg 32 Convergence nd trining rule The convergence proof holds for the deterministic cse, but previous lerning rule do not converge in the nondeterministic cse. The following trining rule is sufficient to ssure convergence of Qˆ to Q Qˆ ( (1 ) Qˆ n n ˆ 1( + n[ r mx Qn 1( s', )] n + where 1 n = 1+ visits ( n deterministic trining rule Copyright Fcundo Bromberg 33 Copyright Fcundo Bromberg 11

Convergence nd trining rule Key ide: revisions to Qˆ re mde more grdully thn in the deterministic cse. n =1 we recover the deterministic lerning rule. Choice of n given bove is one of mny to stisfy the conditions for convergence ccording to theorem by Wtkins nd Dyn (1992) (see Mitchell, not included here). Copyright Fcundo Bromberg 34 Copyright Fcundo Bromberg 12