Administrivia CSE 190: Reinforcement Learning: An Introduction

Similar documents
{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Bellman Optimality Equation for V*

Chapter 4: Dynamic Programming

Reinforcement learning II

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

2D1431 Machine Learning Lab 3: Reinforcement Learning

Reinforcement Learning

19 Optimal behavior: Game theory

1 Probability Density Functions

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

CS 188: Artificial Intelligence Spring 2007

The Regulated and Riemann Integrals

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

We will see what is meant by standard form very shortly

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Lecture 3 Gaussian Probability Distribution

ODE: Existence and Uniqueness of a Solution

Math 1B, lecture 4: Error bounds for numerical methods

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Review of Calculus, cont d

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

CS5371 Theory of Computation. Lecture 20: Complexity V (Polynomial-Time Reducibility)

Monte Carlo method in solving numerical integration and differential equation

4.4 Areas, Integrals and Antiderivatives

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Numerical integration

Lecture 6 Regular Grammars

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Lecture 1: Introduction to integration theory and bounded variation

20 MATHEMATICS POLYNOMIALS

Math& 152 Section Integration by Parts

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Review of Gaussian Quadrature method

Operations with Polynomials

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

CMSC 330: Organization of Programming Languages

5.2 Exponent Properties Involving Quotients

Normal Distribution. Lecture 6: More Binomial Distribution. Properties of the Unit Normal Distribution. Unit Normal Distribution

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

Uninformed Search Lecture 4

Reinforcement learning

Problem Set 3 Solutions

STEP FUNCTIONS, DELTA FUNCTIONS, AND THE VARIATION OF PARAMETERS FORMULA. 0 if t < 0, 1 if t > 0.

LECTURE NOTE #12 PROF. ALAN YUILLE

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

Chapter 3 Solving Nonlinear Equations

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

Best Approximation. Chapter The General Case

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

Physics 201 Lab 3: Measurement of Earth s local gravitational field I Data Acquisition and Preliminary Analysis Dr. Timothy C. Black Summer I, 2018

Improper Integrals, and Differential Equations

Equations and Inequalities

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Continuous Random Variables

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

How to simulate Turing machines by invertible one-dimensional cellular automata

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

How do you know you have SLE?

CS 188: Artificial Intelligence Fall 2010

CS 188: Artificial Intelligence

Review of basic calculus

1 Online Learning and Regret Minimization

a a a a a a a a a a a a a a a a a a a a a a a a In this section, we introduce a general formula for computing determinants.

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

Chapters 4 & 5 Integrals & Applications

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

APPROXIMATE INTEGRATION

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

Math 113 Exam 2 Practice

This lecture covers Chapter 8 of HMU: Properties of CFLs

Numerical Integration

Bernoulli Numbers Jeff Morton

DIRECT CURRENT CIRCUITS

12 TRANSFORMING BIVARIATE DENSITY FUNCTIONS

Chapter 14. Matrix Representations of Linear Transformations

p-adic Egyptian Fractions

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Vectors , (0,0). 5. A vector is commonly denoted by putting an arrow above its symbol, as in the picture above. Here are some 3-dimensional vectors:

ODE: Existence and Uniqueness of a Solution

Artificial Intelligence Markov Decision Problems

3.4 Numerical integration

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

7.2 The Definite Integral

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

CS 275 Automata and Formal Language Theory

Student Activity 3: Single Factor ANOVA

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Summary: Method of Separation of Variables

Transcription:

Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these slides re cribbed from Rich Sutton 2 Gols for this chpter Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming DP Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP Lst Time: Vlue Functions The vlue of stte ihe expected return strting from tht stte; depends on the gent s policy: Stte - vlue function for policy! : # { } = E! & $ " k r t +k +1 V! s = E! R t The vlue of tking n ction in stte under policy! ihe expected return strting from tht stte, tking tht ction, nd therefter following! : % ' k =0 * 3 Action- vlue function for policy! : # { } = E! & $ " k r t + k +1, t = Q! s, = E! R t, t = CSE 190: Reinforcement Lerning, Lecture k = 0on Chpter 4 % ' * 4

Lst Time: Bellmn Eqution for Policy! The bsic ide: R t = r t +1 +! r t +2 +! 2 r t + 3 +! 3 r t + 4! = r t +1 +! r t +2 +! r t + 3 +! 2 r t + 4! Lst Time: More on the Bellmn Eqution V! s =!s, P s s" $% R s s" + # V! s "& ' s" This is set of equtions in fct, liner, one for ech stte. The vlue function for! is its unique solution. = r t +1 +! R t +1 So: V! s = E! R t { } { } = E! r t +1 + " V +1 Bckup digrms: Or, without the expecttion opertor: V! s =!s, P s s" $% R s s" + # V! s "& ' s" 5 for V! for Q! 6 Lst Time: Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V!! s = mx "As Q# s, { } = mx E r t +1 + $ V! +1, t = "As Lst Time: Bellmn Optimlity Eqution for Q* Q! s, = E{ r t +1 + " mxq! +1, #, t = } = P $ % R # + " mxq! s #, # # & ' = mx & P s s% ' R s s% + $ V! s % * "As s% The relevnt bckup digrm: The relevnt bckup digrm: V * ihe unique solution of this system of nonliner equtions. 7 Q * ihe unique solution of this system of nonliner equtions. 8

This Time Policy Evlution How to solve these equtions using itertion Cn solve for optiml V* Policy Evlution: for given policy!, compute the stte-vlue function V! Recll: Stte - vlue function for policy! : But often it is fster to evlute nd improve the policy Alternting figuring out V! nd improving! # { } = E! & $ " k r t + k +1 V! s = E! R t Bellmn eqution for V! : % ' k =0 V! s =!s, P s s" $% R s s" + # V! s "& ' s" system of S simultneous liner equtions * 9 10 Itertive Methods Itertive Methods V 0! V 1!!! V k!!!! V " V 0! V 1!!! V k!!!! V " sweep sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s! "s, P %& R s #' A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s! "s, P %& R s #' 11 12

Itertive Policy Evlution A Smll Gridworld 13 An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte shown twice s shded squres Actionht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched 14 A Smll Gridworld A Smll Gridworld Note here tht the ctions re deterministic, so this eqution: s! "s, P %& R s #' Becomes: s! "s,%& R s #' And it is undiscounted, so this: Becomes: s! "s,%& R s #' s! "s, $% R +V k s #& ' 15 16

A Smll Gridworld A Smll Gridworld s! "s, $% R +V k s #& ' s! "s,up $% R UP + V k s #& ' + "s, RIGHT $% R RIGHT + V k s ##& ' + "s, DOWN $% R DOWN +V k s ### & ' + s! 0.25 "1 + V k s # 0.25 "1 +V k s # 0.25 "1 +V k s # 0.25 "1 + V k s # "s, LEFT $% R LEFT + V k s #### & ' 17 18 A Smll Gridworld A Smll Gridworld For stte 4, for exmple, we hve: 4! 0.25 UP ["1 + V k terminl] + 0.25 RIGHT "1+ V k 5 0.25 DOWN "1+V k 8 0.25 LEFT "1+V k 4 s! 0.25 "1 +V k s # 0.25 "1+V k s # 0.25 "1+ V k s # 0.25 "1+V k s # 19 20

A Smll Gridworld A Smll Gridworld s! 0.25 "1+V k s # 0.25 "1+V k s # 0.25 "1+ V k s # 0.25 "1+V k s # 4! 0.25 UP ["1+ V k terminl] + 0.25 RIGHT "1+ V k 5 0.25 DOWN "1 +V k 8 0.25 LEFT "1+V k 4 21 22 4! 0.25 UP "1+ 0 A Smll Gridworld 0.25 RIGHT "1 + "1 0.25 DOWN "1 +" 1 0.25 LEFT "1 +" 1 = "1.75 Itertive Policy Evlution for the Smll Gridworld! = equiprobble rndom ction choices 23 24

Itertive Policy Evlution for the Smll Gridworld! = equiprobble rndom ction choices But look wht hppens if these vlues re used to mke new policy! note - this won t t lwys hppen! Exercise for the reder: Wht re the vlues of the sttes under the optiml policy? 25 Policy Improvement Suppose we hve computed V! for deterministic policy!. For given stte s, would it be better to do n ction! "s? The vlue of doing in stte s is: { } Q! s, = E! r t +1 + " V! +1, t = = $ P %& R + " V! s #' It is better to switch to ction for stte s if nd only if Q! s, > V! s 26 Policy Improvement Cont. Policy Improvement Cont. Do this for ll stteo get new policy "! tht is greedy with respect to V " : "!s = rgmxq " s, Then V "! # V " = rgmx # P s s! R s s! s! %& + $ V " s!' Wht if V "! = V "? i.e., for ll s #S, V "! s = mx$ s &' R s s! + % V " s!? P s! s! But this ihe Bellmn Optimlity Eqution. So V "! = V # nd both " nd "! re optiml policies. 27 28

Policy Itertion Policy Itertion! 0 " V! 0 "! 1 " V! 1 "!! * " V * "! * policy evlution policy improvement greedifiction 29 30 Jck s Cr Rentl Jck s Cr Rentl $10 for ech cr rented must be vilble when request rec d Two loctions, mximum of 20 cr ech Crs returned nd requested rndomly Poisson distribution, n returns/requests with prob " n e -" /n! where " is the expected number 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight t $2/cr. Sttes, Actions, Rewrds? Trnsition probbilities? Note this mkes sense - loction 2 on verge loses 2 crs per dy. 31 32

Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvelht wy nywy by bus Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems Policy itertion: Cn we do better? Ech itertion involves policy evlution, which is itself n itertive process It looks like from the previous exmple tht policy evlution my converge long fter the greedy policy bsed on the vlues hs converged. Cn we skip steps somehow? Yes: policy evlution cn be stopped erly nd under most cses, convergence is still gurnteed! A very specil cse: Stopping fter one sweep of policy evlution. This is clled vlue itertion 33 34 Vlue Itertion Vlue Itertion Cont. Recll the full policy-evlution bckup: s! "s, P %& R s #' Here ihe full vlue-itertion bckup: s! mx P s s" s" $% R s s" + # V k s "& ' Note how this combines policy improvement nd evlution. It is simply the Bellmn optimlity eqution turned into n updte eqution! In prctice, often policy evlution sum is performed severl times between policy improvement mx sweeps. 35 36

Gmbler s Problem Gmbler s Problem Solution Gmbler cn repetedly bet $ on coin flip Heds he wins his stke, tils he loses it Initil cpitl # {$1, $2, $99} Gmbler wins if his cpitl becomes $100 loses if it becomes $0 Coin is unfir Heds gmbler wins with probbility p =.4! n n! e"! Sttes, Actions, Rewrds? 37 38 Herd Mngement Asynchronous DP You re consultnt to frmer mnging herd of cows Herd consists of 5 kinds of cows: Young Milking Breeding Old Sick Number of ech kind ihe Stte Number sold of ech kind ihe Action Cowrnsition from one kind to nother Young cows cn be born All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select stteo bckup intelligently? YES: n gent s experience cn ct s guide. 39 40

Generlized Policy Itertion Generlized Policy Itertion GPI: ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles wht Bellmn clled the curse of dimensionlity. In prctice, clssicl DP cn be pplied to problems with few million sttes. Asynchronous DP cn be pplied to lrger problems, nd is pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. 41 42 Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckupo be contrsted lter with smple bckups Asynchronous DP: wy to void exhustive sweeps Generlized Policy Itertion GPI Bootstrpping: updting estimtes bsed on other estimtes END 43