Reinforcement learning

Similar documents
Reinforcement Learning and Policy Reuse

Reinforcement learning II

Artificial Intelligence Markov Decision Problems

Bellman Optimality Equation for V*

Reinforcement Learning

19 Optimal behavior: Game theory

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Administrivia CSE 190: Reinforcement Learning: An Introduction

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

2D1431 Machine Learning Lab 3: Reinforcement Learning

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Markov Decision Processes

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

Reinforcement Learning for Robotic Locomotions

1 Online Learning and Regret Minimization

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Actor-Critic. Hung-yi Lee

Bias in Natural Actor-Critic Algorithms

We will see what is meant by standard form very shortly

CS 188: Artificial Intelligence

APPENDIX 2 LAPLACE TRANSFORMS

Lecture 20: Numerical Integration III

CS 188: Artificial Intelligence Fall 2010

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

MArkov decision processes (MDPs) have been widely

Efficient Planning in R-max

Chapter 14. Matrix Representations of Linear Transformations

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

20 MATHEMATICS POLYNOMIALS

Bayesian Networks: Approximate Inference

Policy Gradient Methods for Reinforcement Learning with Function Approximation

dt. However, we might also be curious about dy

Lecture 3 Gaussian Probability Distribution

Overview of Calculus I

Scalable Learning in Stochastic Games

DIRECT CURRENT CIRCUITS

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

Improper Integrals, and Differential Equations

Near-Bayesian Exploration in Polynomial Time

2. The Laplace Transform

Jonathan Mugan. July 15, 2013

Thomas Whitham Sixth Form

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

1 Linear Least Squares

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Lecture 12: Numerical Quadrature

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

Week 10: Line Integrals

Consequently, the temperature must be the same at each point in the cross section at x. Let:

Chapter 3 Polynomials

n f(x i ) x. i=1 In section 4.2, we defined the definite integral of f from x = a to x = b as n f(x i ) x; f(x) dx = lim i=1

Math 231E, Lecture 33. Parametric Calculus

A Fast and Reliable Policy Improvement Algorithm

NUMERICAL INTEGRATION

Review of Calculus, cont d

Hidden Markov Models

ODE: Existence and Uniqueness of a Solution

Math 1431 Section M TH 4:00 PM 6:00 PM Susan Wheeler Office Hours: Wed 6:00 7:00 PM Online ***NOTE LABS ARE MON AND WED

Chapter 0. What is the Lebesgue integral about?

Numerical Analysis: Trapezoidal and Simpson s Rule

Lecture 17. Integration: Gauss Quadrature. David Semeraro. University of Illinois at Urbana-Champaign. March 20, 2014

Module 9: Tries and String Matching

Module 9: Tries and String Matching

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

WE would like to build intelligent agents that can. Autonomous Learning of High-Level States and Actions in Continuous Environments

Math 31S. Rumbos Fall Solutions to Assignment #16

Math 270A: Numerical Linear Algebra

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

( ) 1. Algebra 2: Final Exam Review. y e + e e ) 4 x 10 = 10,000 = 9) Name

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Chapters 4 & 5 Integrals & Applications

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Partial Derivatives. Limits. For a single variable function f (x), the limit lim

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

JURONG JUNIOR COLLEGE

Lecture 1: Introduction to integration theory and bounded variation

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Chapter 6 Notes, Larson/Hostetler 3e

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Math 42 Chapter 7 Practice Problems Set B

STABILITY and Routh-Hurwitz Stability Criterion

Chapter 3 Solving Nonlinear Equations

Section 6.1 Definite Integral

The Regulated and Riemann Integrals

Module 6: LINEAR TRANSFORMATIONS

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Recitation 3: More Applications of the Derivative

REINFORCEMENT learning (RL) was originally studied

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

Learning to Serve and Bounce a Ball

Analysis of Variance and Design of Experiments-II

The momentum of a body of constant mass m moving with velocity u is, by definition, equal to the product of mass and velocity, that is

Transcription:

Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern by doing

Reinforcement lerning: In ech time tep: Tke ome ction Bic cheme Oberve the outcome of the ction: ucceor tte nd rewrd Updte ome internl repreenttion of the environment nd policy If you rech terminl tte jut trt over ech p through the environment i clled tril Why i thi clled reinforcement lerning?

Appliction of reinforcement Bckgmmon lerning http://www.reerch.ibm.com/mive/tdl.html com/mive/tdl html http://en.wikipedi.org/wiki/td-gmmon

Appliction of reinforcement lerning Lerning ft git for Aibo Initil git Lerned git Policy Grdient Reinforcement Lerning for Ft udrupedl Locomotion Nte Kohl nd Peter Stone. IEEE Interntionl Conference on Robotic nd Automtion 2004.

Appliction of reinforcement lerning Stnford utonomou helicopter

Reinforcement lerning Model-bed trtegie Lern the model of the MDP trnition probbilitie nd rewrd nd try to olve the MDP concurrently Model-free Lern how to ct without explicitly lerning the trnition probbilitie P -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte t

Model-bed reinforcement lerning Bic ide: try to lern the model of the MDP trnition probbilitie nd rewrd nd lern how to ct olve the MDP imultneouly Lerning the model: Keep trck of how mny time tte follow tte when you tke ction nd updte the trnition probbility P ccording to the reltive frequencie Keep trck of the rewrd R Lerning how to ct: Etimte the utilitie U uing Bellmn eqution Chooe the ction tht mximize expected future utility: * π = rg mx A ' P ' U '

Model-bed reinforcement lerning Lerning how to ct: Etimte the utilitie U uing Bellmn eqution Chooe the ction tht mximize expected future utility given the model of the environment we ve experienced through our ction o fr: * π = rg mx P ' U ' A ' I there ny yproblem with thi greedy pproch?

Explortion v. exploittion Explortion: tke new ction with unknown conequence Pro: Get more ccurte model of the environment Dicover higher-rewrd tte thn the one found o fr Con: When you re exploring you re not mximizing your utility Something bd might hppen Exploittion: go with the bet trtegy found o fr Pro: Mximize rewrd reflected in the current utility etimte Avoid bd tuff Con: Might lo prevent you from dicovering the true optiml trtegy

Incorporting explortion Ide: explore more in the beginning become more nd more greedy over time Stndrd greedy election of optiml ction: = rg mx P ' ' U ' ' A Modified trtegy: = rg mx f P ' ' A ' ' ' U ' N ' explortion function Number of time we ve tken ction in tte f u n = + R if n < Ne u otherwie optimitic rewrd etimte

Model-free reinforcement lerning Ide: lern how to ct without explicitly lerning the trnition probbilitie P -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte Reltionhip between -vlue nd utilitie: U = mx

Model-free reinforcement lerning -lerning: lern n ction-utility function tht tell u the vlue of doing ction in tte U = mx Equilibrium contrint on vlue: = R + γ P ' mx' ' ' ' Problem: we don t know nd don t wnt to lern P

Temporl difference TD lerning Equilibrium contrint on vlue: Temporl difference TD updte: + = ' ' ' ' mx ' P R γ Temporl difference TD updte: Pretend tht the currently oberved trnition i the only poible outcome nd djut the vlue yp j towrd the locl equilibrium ' ' mx ' R locl + = γ 1 locl new locl new + = + = α α α ' ' mx ' R new + + = γ α + = α

Temporl difference TD lerning At ech time tep t From current tte elect n ction : = ' N ' rg mx ' f Explortion function Get the ucceor tte Perform the TD updte: Number of time we ve tken ction from tte R + γ mx ' ' + α ' Lerning rte Should trt t 1 nd decy O1/t e.g. αt = 60/59 + t

Function pproximtion So fr we ve umed lookup tble repreenttion for utility function U or ction-utility function But wht if the tte t pce i relly lrge or continuou? Alterntive ide: pproximte the utility function weighted liner combintion of feture: U = w1 f1 + w2 f2 +Kwn fn RL lgorithm cn be modified to etimte thee weight Recll: feture for deigning evlution function in gme Benefit: Cn hndle very lrge tte pce gme continuou tte pce robot control Cn generlize to previouly uneen tte