Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Similar documents
CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9)

Reinforcement learning

Presentation Overview

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Online Convex Optimization Example And Follow-The-Leader

Phys 221 Fall Chapter 2. Motion in One Dimension. 2014, 2005 A. Dzyubenko Brooks/Cole

ARTIFICIAL INTELLIGENCE. Markov decision processes

NEWTON S SECOND LAW OF MOTION

Differential Geometry: Numerical Integration and Surface Flow

Planning in POMDPs. Dominik Schoenberger Abstract

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Notes on online convex optimization

Hidden Markov Models. Adapted from. Dr Catherine Sweeney-Reed s slides

Phys1112: DC and RC circuits

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t)

Ensamble methods: Bagging and Boosting

Ensamble methods: Boosting

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2

Classical Conditioning IV: TD learning in the brain

Machine Learning 4771

Viterbi Algorithm: Background

Linear Time-invariant systems, Convolution, and Cross-correlation

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

1 Review of Zero-Sum Games

Chapter 8 The Complete Response of RL and RC Circuits

Reinforcement Learning: A Tutorial. Scope of Tutorial. 1 Introduction

Hidden Markov Models

Learning to Take Concurrent Actions

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

CS 4495 Computer Vision Tracking 1- Kalman,Gaussian

Physics Notes - Ch. 2 Motion in One Dimension

CHAPTER 6: FIRST-ORDER CIRCUITS

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Lecture 2 October ε-approximation of 2-player zero-sum games

Solutions for Assignment 2

Linear Response Theory: The connection between QFT and experiments

CHAPTER 12 DIRECT CURRENT CIRCUITS

Robust Learning Control with Application to HVAC Systems

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Solutions - Midterm Exam

20. Applications of the Genetic-Drift Model

Embedded Systems and Software. A Simple Introduction to Embedded Control Systems (PID Control)

A Dynamic Model of Economic Fluctuations

Off-policy TD(λ) with a true online equivalence

מקורות לחומר בשיעור ספר הלימוד: Forsyth & Ponce מאמרים שונים חומר באינטרנט! פרק פרק 18

Economics 8105 Macroeconomic Theory Recitation 6

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Dynamic Programming 11/8/2009. Weighted Interval Scheduling. Weighted Interval Scheduling. Unweighted Interval Scheduling: Review

Section 4.4 Logarithmic Properties

On-line Adaptive Optimal Timing Control of Switched Systems

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

A Reinforcement Learning Approach for Collaborative Filtering

Open loop vs Closed Loop. Example: Open Loop. Example: Feedforward Control. Advanced Control I

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence

dt = C exp (3 ln t 4 ). t 4 W = C exp ( ln(4 t) 3) = C(4 t) 3.

non-linear oscillators

Online Learning, Regret Minimization, Minimax Optimality, and Correlated Equilibrium

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

BU Macro BU Macro Fall 2008, Lecture 4

= ( ) ) or a system of differential equations with continuous parametrization (T = R

Self assessment due: Monday 4/29/2019 at 11:59pm (submit via Gradescope)

) were both constant and we brought them from under the integral.

1904 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 58, NO. 4, MAY 2009

Echocardiography Project and Finite Fourier Series

An recursive analytical technique to estimate time dependent physical parameters in the presence of noise processes

Lecture 33: November 29

18 IMITATION LEARNING

Random Walk with Anti-Correlated Steps

6/27/2012. Signals and Systems EE235. Chicken. Today s menu. Why did the chicken cross the Möbius Strip? To get to the other er um

Topic 1: Linear motion and forces

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

Robust and Learning Control for Complex Systems

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Sequential Importance Resampling (SIR) Particle Filter

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

The average rate of change between two points on a function is d t

The Arcsine Distribution

Chapter 15. Time Series: Descriptive Analyses, Models, and Forecasting

Bias-Variance Error Bounds for Temporal Difference Updates

Brock University Physics 1P21/1P91 Fall 2013 Dr. D Agostino. Solutions for Tutorial 3: Chapter 2, Motion in One Dimension

Competitive and Cooperative Inventory Policies in a Two-Stage Supply-Chain

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Section 4.4 Logarithmic Properties

Electromagnetic Induction: The creation of an electric current by a changing magnetic field.

Notes on Kalman Filtering

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

8. Basic RL and RC Circuits

True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski

Physics 101: Lecture 03 Kinematics Today s lecture will cover Textbook Sections (and some Ch. 4)

Chapter 7: Solving Trig Equations

Longest Common Prefixes

Lecture Notes 2. The Hilbert Space Approach to Time Series

Tournament selection in zeroth-level classifier systems based on. average reward reinforcement learning

Position, Velocity, and Acceleration

Financial Econometrics Jeffrey R. Russell Midterm Winter 2009 SOLUTIONS

10/10/2011. Signals and Systems EE235. Today s menu. Chicken

Transcription:

CSE 47 Chaper Reinforcemen Learning The Reinforcemen Learning Agen Agen Sae u Reward r Acion a Enironmen CSE AI Faculy

Why reinforcemen learning Programming an agen o drie a car or fly a helicoper is ery hard! Can an agen learn o drie or fly hrough posiie/negaie rewards CSE AI Faculy Why reinforcemen learning Can an agen learn o win a board games hrough rewards Win = large posiie reward, Lose = negaie Learn ealuaion funcion for differen board posiions Play games agains iself CSE AI Faculy 4

Why reinforcemen learning Humans and animals learn hrough rewards Reinforcemen learning as a model of brain funcion Palo s dog Training: Bell Food Afer: Bell Saliae CSE AI Faculy 5 Toy Example: Agen in a Maze 0 Reward -0 Punishmen 4 Saes = Maze locaions,,,, Acions = Moe forward, lef, righ, back Rewards = 0 a,4, -0 a,4 - a ohers cos of moing CSE AI Faculy 6

Acions migh be noisy An acion may no always succeed E.g. 0.9 probabiliy of moing forward, 0. probabiliy diided equally among oher neighboring locaions Characerized by ransiion probabiliies: Pnex sae curren sae, acion CSE AI Faculy 7 Goal: Learn a Policy 0-0 4 Policy = for each sae, wha is he bes acion ha maximizes my expeced reward CSE AI Faculy 8 4

Goal: Learn a Policy 0-0 4 The Opimal Policy CSE AI Faculy 9 A cenral problem in all hese cases is learning o predic fuure reward How do we do i Can we use superised learning 5

Predicing Delayed Rewards Time: 0 T wih inpu u and reward r possibly 0 a each ime sep Key Idea: Make he oupu of superised learner predic oal expeced fuure reward saring from ime T = 0 r < > denoes aerage CSE AI Faculy Learning o Predic Delayed Rewards Use a se of modifiable weighs w and predic based on all pas inpus u: = = 0 w u Linear neural nework Would like o find w ha minimize: T = 0 r Can we minimize his using gradien descen and dela rule Yes, BUT no ye aailable are fuure rewards CSE AI Faculy 6

7 CSE AI Faculy Temporal Difference TD Learning Key Idea: Rewrie squared error o ge rid of fuure erms: 0 0 r r r r T T = = = CSE AI Faculy 4 Temporal Difference TD Learning TD Learning: For each ime sep, do: For all 0, do: ] [ ε u r w w 0 = = u w Expeced fuure reward Predicion

Temporal Difference Learning in he Brain Aciiy of a Dopaminergic cell in Venral Tegmenal Area Reward Predicion error [ r ] Before Training Afer Training [ 0 ] No error r CSE AI Faculy 5 Selecing Acions when Reward is Delayed Can we learn he opimal policy for his maze Saes: A, B, or C Possible acions a any sae: Lef L or Righ R If you randomly choose o go L or R random policy, wha is he alue of each sae CSE AI Faculy 6 8

Policy Ealuaion Locaion, acion new locaion u,a u Use oupu u = wu For random policy: B = 0 5 =.5 C = 0 = A = B C =.75 Can learn his using TD learning: w u w u ε [ ra u u' u] CSE AI Faculy 7 Maze Value Learning for Random Policy.75.5 Once I know he alues, I can pick he acion ha leads o he higher alued sae! CSE AI Faculy 8 9

Selecing Acions based on Values B =.5 C = Values ac as surrogae immediae rewards Locally opimal choice leads o globally opimal policy Relaed o Dynamic Programming CSE AI Faculy 9 Q learning Simple mehod for acion selecion based on acion alues or Q alues Qu,a where u is a sae and a is an acion. Le u be he curren sae. Selec an acion a according o: P a = exp βq u, a exp βq u, a' a'. Execue a and record new sae u and reward r. Updae Q: Q u, a Q u, a ε r max a' Q u', a' Q u, a. Repea unil an end sae is reached CSE AI Faculy 0 0

Reinforcemen Learning Applicaions Example: Flying a helicopor ia reinforcemen learning ideos work of Andrew Ng, Sanford hp://ai.sanford.edu/~ang/ CSE AI Faculy