Reinforcement learning II

Similar documents
Reinforcement Learning

Administrivia CSE 190: Reinforcement Learning: An Introduction

Bellman Optimality Equation for V*

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

2D1431 Machine Learning Lab 3: Reinforcement Learning

Reinforcement learning

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Artificial Intelligence Markov Decision Problems

19 Optimal behavior: Game theory

Reinforcement Learning and Policy Reuse

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

CS 188: Artificial Intelligence Spring 2007

1 Online Learning and Regret Minimization

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

This lecture covers Chapter 8 of HMU: Properties of CFLs

Chapter 5 : Continuous Random Variables

Continuous Random Variables

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

Review of Calculus, cont d

CS 188: Artificial Intelligence

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Bayesian Networks: Approximate Inference

Actor-Critic. Hung-yi Lee

CS 188: Artificial Intelligence Fall 2010

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Scalable Learning in Stochastic Games

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS 188: Artificial Intelligence Fall Announcements

We will see what is meant by standard form very shortly

Math& 152 Section Integration by Parts

Acceptance Sampling by Attributes

Markov decision processes

Numerical Analysis: Trapezoidal and Simpson s Rule

LECTURE NOTE #12 PROF. ALAN YUILLE

Hybrid Control and Switched Systems. Lecture #2 How to describe a hybrid system? Formal models for hybrid system

Lecture 3 Gaussian Probability Distribution

1 Probability Density Functions

Math Calculus with Analytic Geometry II

Math 8 Winter 2015 Applications of Integration

ODE: Existence and Uniqueness of a Solution

Chapter 0. What is the Lebesgue integral about?

Applying Q-Learning to Flappy Bird

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

Name Solutions to Test 3 November 8, 2017

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Learning to Serve and Bounce a Ball

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Ph2b Quiz - 1. Instructions

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon

Non-Linear & Logistic Regression

Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms

Monte Carlo method in solving numerical integration and differential equation

Student Activity 3: Single Factor ANOVA

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Integral equations, eigenvalue, function interpolation

Data Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading

Math 113 Exam 2 Practice

How do you know you have SLE?

Online Markov Decision Processes under Bandit Feedback

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Hidden Markov Models

Lecture 19: Continuous Least Squares Approximation

Near-Bayesian Exploration in Polynomial Time

Improper Integrals, and Differential Equations

KNOWLEDGE-BASED AGENTS INFERENCE

Read section 3.3, 3.4 Announcements:

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Review of basic calculus

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Jonathan Mugan. July 15, 2013

A Fast and Reliable Policy Improvement Algorithm

New data structures to reduce data size and search time

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

A signalling model of school grades: centralized versus decentralized examinations

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

Chapters 4 & 5 Integrals & Applications

1 Error Analysis of Simple Rules for Numerical Integration

Extended nonlocal games from quantum-classical games

REINFORCEMENT learning (RL) was originally studied

Lecture 20: Numerical Integration III

The Regulated and Riemann Integrals

Math 270A: Numerical Linear Algebra

Lecture Note 9: Orthogonal Reduction

How to simulate Turing machines by invertible one-dimensional cellular automata

A Continuous-time Markov Decision Process Based Method on Pursuit-Evasion Problem

Uninformed Search Lecture 4

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

In-Class Problems 2 and 3: Projectile Motion Solutions. In-Class Problem 2: Throwing a Stone Down a Hill

Numerical integration

Transcription:

CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic Lerner intercts with the environment Receives input with informtion bout the environment (e.g. from sensors) Mkes ctions tht (my) effect the environment Receives reinforcement signl tht provides feedbck on how well it performed 1

Reinforcement lerning Objective: Lern how to ct in the environment in order to mximize the reinforcement signl The selection of ctions should depend on the input A policy : X A mps inputs to ctions Gol: find the optiml policy : X A tht gives the best expected reinforcements Input x Lerner Output Reinforcement r Critic Exmple: lern how to ply gmes (AlphGo) Gmbling exmple Gme: 3 bised coins 1 2 3 The coin to be tossed is selected rndomly from the three coin options. The gent lwys sees which coin is going to be plyed next. The gent mkes bet on either hed or til with wge of $1. If fter the coin tos the outcome grees with the bet, the gent wins $1, otherwise it looses $1 RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : X A Exmple: : Coin1 Coin2 Coin3 hed til hed : 1 2 3 hed til hed 2

Gmbling exmple RL model: Input: X coin chosen for the next tos Action: A choice of hed or til the gent bets on, Reinforcements: {1, -1} A policy : Coin1 Coin2 Coin3 hed til hed Stte, ction rewrd trjectories Step0 Step1 Step2 Step k stte ction Coin2 Til Coin1 Hed Coin2 Til.. Coin1 Hed.. rewrd -1 1 1 1 RL lerning: objective functions Objective: Find policy : X A Coin3? Tht mximizes some combintion of future reinforcements (rewrds) received over time Vlution models (quntify how good the mpping is): Finite horizon models E( T r t t0 T Infinite horizon discounted model 0 t Averge rewrd ) Time horizon: E( t rt ) Discount fctor: 1 lim E( T T T t0 r ) t T 0 E( ) Discount fctor: 0 1 t0 t rt : Coin1? Coin2? 0 1 3

Expected rewrd E( 0 t t rt RL with immedite rewrds ) Immedite rewrd cse: Rewrd depends only on x nd the ction choice The ction does not ffect the environment nd hence future inputs (sttes) nd future rewrds Expected one step rewrd for input x (coin to ply next) nd the choice : 0 1 Expected rewrd t0 RL with immedite rewrds t 2 E( rt ) E( r0 ) E( r1 ) E( r2 )... Optiml strtegy: : X A ( x) rg mx : Expected one step rewrd for input x (coin to ply next) nd the choice 4

RL with immedite rewrds The optiml choice ssumes we know the expected rewrd Then: Cvets ( x) rg mx We do not know the expected rewrd We need to estimte it using from interction We cnnot determine the optiml policy if the estimte of the expected rewrd is not good We need to try lso ctions tht look suboptiml wrt the ~ current estimtes of ~ Estimting Solution 1: For ech input x try different ctions Estimte using the verge of observed rewrds ) ~ 1 x, N N x, i1 r i Solution 2: online pproximtion Updtes n estimte fter performing ction in x nd x observing the rewrd r, R ~ ( ( i) (1 ( i)) R ~ ( ( i1) ( i) r i (i) - lerning rte 5

RL with immedite rewrds At ny step in time i during the experiment we hve estimtes of expected rewrds for ech (coin, ction) pir: coin1, hed) coin1, til) coin2, hed) coin2, til) coin3, hed) coin3, til) Assume the next coin to ply in step (i+1) is coin 2 nd we pick hed s our bet. Then we updte ~ ( i 1) coin2, hed) using the observed rewrd nd one of the updte strtegy bove, nd keep the rewrd estimtes for the remining (coin, ction) pirs unchnged, e.g. ~ ( i1) ~ ( ) coin2, til) coin2, til) i Explortion vs. Exploittion Uniform explortion: Uses explortion prmeter 0 1 Choose the current best choice with probbility ˆ ( x) rg mx R ~ ( A 1 All other choices re selected with uniform probbility A 1 Advntges: Simple, esy to implement Disdvntges: Explortion more pproprite t the beginning when we do not ~ hve good estimtes of Exploittion more pproprite lter when we hve good estimtes 6

Explortion vs. Exploittion Boltzmn explortion The ction is chosen rndomly but proportionlly to its current expected rewrd estimte Cn be tuned with temperture prmeter T to promote explortion or exploittion Probbility of choosing ction expr ~ ( / T p( x) ~ expr ( ') / T ' A Effect of T: For high vlues of T, p( x) is uniformly distributed for ll ctions For low vlues of T, p( x) of the ction with the highest ~ vlue of is pproching 1 Agent nvigtion exmple Agent nvigtion in the mze: 4 moves in compss directions Effects of moves re stochstic we my wind up in other thn intended loction with non-zero probbility Objective: lern how to rech the gol stte in the shortest expected time moves G 7

Agent nvigtion exmple The RL model: Input: X position of n gent Output: A the next move Reinforcements: R -1 for ech move +100 for reching the gol A policy: : X A Gol: find the policy mximizing future expected rewrds E( 0 t t rt ) : Position 1 Position 2 Position 25 G right right left 0 1 moves Agent nvigtion exmple Stte, ction rewrd trjectories policy : Position 1 Position 2 Position 25 right right left 21 22 23 24 25 16 17 18 19 20 G 11 12 3 14 15 6 7 8 9 10 1 2 3 4 5 moves Step0 Step1 Step2 Step k stte ction Pos1 Right Pos2 Right Pos3 Up.. Pos15 Up.. rewrd -1-1 -1-1 8

Lerning with delyed rewrds Action in ddition to immedite rewrds ffect the next stte of the environment nd thus indirectly lso future rewrds We need model to represent environment chnges nd the the effect of ctions on sttes nd rewrds ssocited with them Mrkov decision process (MDP) Frequently used in AI, OR, control theory ction t-1 stte t-1 stte t rewrd t-1 Mrkov decision process ction t-1 stte t-1 stte t Forml definition: ( S, A, T, R) A set of sttes S (X ) loctions of robot A set of ctions move ctions Trnsition model S AS [0,1] where cn I get with different moves Rewrd model S A S rewrd/cost for trnsition A 4-tuple rewrd t-1 9

MDP problem We wnt to find the best policy : S A Vlue function ( V ) for policy, quntifies the goodness of policy through, e.g. infinite horizon, discounted model It: E( 0 t 1. combines future rewrds over trjectory 2. combines rewrds for multiple trjectories (through expecttion-bsed mesures) t rt ) G G Vlue of policy for MDP Assume fixed policy : S A How to compute the vlue of policy under infinite horizon discounted model? A fixed point eqution: V ( s) ( s)) S P( ( s)) V ( ) expected one step rewrd for the first ction v r Uv expected discounted rewrd for following the policy for the rest of the steps v ( I U) 1 r For finite stte spce we get set of liner equtions 10

Optiml policy The vlue of the optiml policy V ( s) mx P( V ( ) A S expected one step rewrd for the first ction expected discounted rewrd for following the opt. policy for the rest of the steps The optiml policy: :S A ( s) rg mx P( V ( ) A S Computing optiml policy Dynmic progrmming: Vlue itertion: computes the optiml vlue function first then the policy itertive pproximtion converges to the optiml vlue function Vlue itertion ( ) initilize V ;; V is vector of vlues for ll sttes repet set V' V set V ( s) mx P( V '( ) A S until V' V output ( s) rg mx P( V ( ) A S 11

Reinforcement lerning of optiml policies In the RL frmework we do not know the MDP model!!! Gol: lern the optiml policy Two bsic pproches: : S A Model bsed lerning Lern the MDP model (probbilitie rewrds) first Solve the MDP fterwrds Model-free lerning Lern how to ct directly No need to lern the prmeters of the MDP A number of clones of the two in the literture Model-bsed lerning We need to lern trnsition probbilities nd rewrds Lerning of probbilities ML prmeter estimtes Use counts Ns s P ~,, ' ( N s N Lerning rewrds Similr to lerning with immedite rewrds R ~ 1 ( N N s, i1 r i,, S Problem: chnges in the probbilities nd rewrd estimtes would require us to solve n MDP from scrtch! (fter every ction nd rewrd seen) N or the online solution 12

Model free lerning Motivtion: vlue function updte (vlue itertion): V Let ( s) mx A Q( S S P( V P( V ( ) ( ) Then V ( s) mx Q( A Note tht the updte cn be defined purely in terms of Q- functions Q( S P( mx Q(, ') ' Q-lerning Q-lerning uses the Q-vlue updte ide But relies on stochstic (on-line, smple by smple) updte Q( is replced with S P( mx Q(, ') Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') r( - rewrd received from the environment fter performing n ction in stte s - new stte reched fter ction - lerning rte, function of s N, - number of times hs been executed t s ' ' 13

Q-function updtes in Q-lerning At ny step in time i during the experiment we hve estimtes of Q functions for ech (stte, ction) pir: Q( position1, up) ( position 1, left) ( position 1, right ) ( position 1, down) Q Q Q Q( position 2, up) Assume the current stte is position 1 nd we pick up ction to be performed next. ~ After we observe the rewrd, we updte Q( position 1, up), nd keep the Q function estimtes for the remining (stte, ction) pirs unchnged. Q-lerning The on-line updte rule is pplied repetedly during the direct interction with n environment Q-lerning initilize Q( =0 for ll pirs observe current stte s repet select ction ; use some explortion/exploittion schedule receive rewrd r observe next stte s updte Q( (1 ) Q( r mx Q(, ') ' set s to s end repet 14

Q-lerning convergence The Q-lerning is gurnteed to converge to the optiml Q- vlues under the following conditions: Every stte is visited nd every ction in tht stte is tried infinite number of times This is ssured vi explortion/exploittion schedule The sequence of lerning rtes for ech Q( stisfies: i1 1. ( i) 2. i1 (i) 2 ( n( ) - is the lerning rte for the nth tril of ( RL with delyed rewrds The optiml choice ( s) rg mx Q( much like wht we hd for the immedite rewrds ( x) rg mx RL Lerning Insted of exct vlues of Q( we use Since we hve only estimtes of Qˆ ( We need to try lso ctions tht look suboptiml wrt the current estimtes Explortion/exploittion strtegies Uniform explortion Boltzmn explortion Qˆ ( Qˆ( (1 ) Qˆ( r( mx Qˆ(, ') ' 15

Q-lerning speed-ups The bsic Q-lerning rule updtes my propgte distnt (delyed) rewrds very slowly Exmple: G Gol: high rewrd stte To mke the correct decision we need ll Q-vlues for the current position to be good Problem: in ech run we bck-propgte vlues only one-step bck. It tkes multiple trils to bck-propgte vlues multiple steps. Q-lerning speed-ups Remedy: Bckup vlues for lrger number of steps Rewrds from pplying the policy 2 qt rt r 1 r... t t2 i0 r i ti We cn substitute (immedite rewrds with n-step rewrds): n n 1 q i n t r ti mx Qt n(, ') i0 ' Postpone the updte for n steps nd updte with longer trjectory rewrds n Qt n1( Qt n( q t Qt n( Problems: - lrger vrince - explortion/exploittion switching - wit n steps to updte 16

Q-lerning speed-ups One step vs. n-step bckup G G Problems with n-step bckups: - lrger vrince - explortion/exploittion switching - wit n steps to updte Q-lerning speed-ups Temporl difference (TD) method Remedy of the wit n-steps problem Prtil bck-up fter every simultion step Similr ide: wether forecst djustment G Different versions of this ide hs been implemented 17

RL successes Reinforcement lerning is reltively simple On-line techniques cn trck non-sttionry environments nd dpt to its chnges Successful pplictions: Deep Mind s AlphGo (Alph Zero) TD Gmmon lerned to ply bckgmmon on the chmpionship level Elevtor control Dynmic chnnel lloction in mobile telephony Robot nvigtion in the environment 18