REINFORCEMENT LEARNING

Similar documents
(Deep) Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning for NLP

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Reinforcement Learning and NLP

Introduction to Reinforcement Learning

David Silver, Google DeepMind

Reinforcement Learning. Introduction

CS230: Lecture 9 Deep Reinforcement Learning

, and rewards and transition matrices as shown below:

Q-Learning in Continuous State Action Spaces

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS599 Lecture 1 Introduction To RL

Grundlagen der Künstlichen Intelligenz

Q-learning. Tambet Matiisen

Variance Reduction for Policy Gradient Methods. March 13, 2017

Temporal Difference Learning & Policy Iteration

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Internet Monetization

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Lecture 1: March 7, 2018

Deep Reinforcement Learning

Human-level control through deep reinforcement. Liia Butler

Reinforcement learning an introduction

Trust Region Policy Optimization

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning

Policy Gradient Methods. February 13, 2017

Reinforcement Learning. Machine Learning, Fall 2010

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

CSC321 Lecture 22: Q-Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture 23: Reinforcement Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Artificial Intelligence

Reinforcement Learning. George Konidaris

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Decision Theory: Q-Learning

Lecture 3: Markov Decision Processes

Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

The convergence limit of the temporal difference learning

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

6 Reinforcement Learning

Machine Learning I Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning: An Introduction

An Introduction to Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Autonomous Helicopter Flight via Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning: the basics

Reinforcement Learning

Multiagent (Deep) Reinforcement Learning

Reinforcement Learning

Reinforcement Learning via Policy Optimization

Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning

Approximation Methods in Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning and Control

Reinforcement Learning as Classification Leveraging Modern Classifiers

15-780: ReinforcementLearning

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning (1)

Reinforcement Learning Part 2

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning as Variational Inference: Two Recent Approaches

Q-Learning for Markov Decision Processes*

Lecture 3: The Reinforcement Learning Problem

16.410/413 Principles of Autonomy and Decision Making

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning

Planning in Markov Decision Processes

CS 188: Artificial Intelligence

Reinforcement Learning II. George Konidaris

Deep Reinforcement Learning via Policy Optimization

CS 570: Machine Learning Seminar. Fall 2016

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Transcription:

REINFORCEMENT LEARNING

Larry Page: Where s Google going next?

DeepMind's DQN playing Breakout

Contents Introduction to Reinforcement Learning Deep Q-Learning

INTRODUCTION TO REINFORCEMENT LEARNING

Contents Reinforcement Learning Markov Decision Process Value function Bellman equation Action value function (Q-function)

Supervised Learning Training samples: x i, y i Training goal: To find a function y i f(x i ) Atari game example: x i, y i = (, ) Game state Joystick control

Reinforcement Learning Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Atari Example

Reinforcement Learning Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external en vironment Learning what to do how to map situations to actions so as t o maximize a numerical reward signal

Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward (sacrifice short-term gains for gre ater long-term gains) The need to explore and exploit Considers the whole problem of a goal-directed agent interactin g with an uncertain environment

THEORY

Reinforcement Learning Setting S set of states A set of actions R S A R reward for given state and action.

Reinforcement Learning Setting The agent-environment interaction in reinforcement learning P{s t+1 = s, r t+1 = r s t, a t, s t 1, a t 1,, s 0, a 0 }

Reward Discount rate: γ [0,1) It is the discount factor, which represents the difference in importance between future rewards and present rewards. R t = r t+1 + γr t+2 + γ 2 r t+3 + = k=0 γ k r t+k+1

Markov Decision Process Markov Decision Process a reinforcement learning task that satisfies the Markov Property P{s t+1 = s, r t+1 = r s t, a t, s t 1, a t 1,, s 0, a 0 } = P{s t+1 = s, r t+1 = r s t, a t } P a ss R a ss = P{s t+1 = s s t = s, a t = a} = E{r t+1 s t = s, a t = a, s t+1 = s }

Policy Policy π A policy π is a mapping from each state, s S, and action a A(s), to the probability π s, a of tacking action a when in state s.

Value Functions State-value function for policy π. V π s = E R t s t = s} = E{ γ k r t+k+1 s t = s} Action-value function for policy π. k=0 Q π s, a = E R t s t = s, a t = a} = E{ γ k r t+k+1 s t = s, a t = a} k=0

Optimal Value Functions V s = max π Vπ (s) Q s, a = max π Qπ s, a = E{r t+1 + γv (s t+1 ) s t = s, a t = a}

Bellman Equation Bellman Equation for V π V π s = a π(a, s) s P a ss R a ss + γv π s Bellman Equation for V s V s = max a s P a ss R a ss Bellman equation for Q s, a + γv s Q a s, a = s P ss R a ss + γ max a Q s, a

Bellman Equation (Fixed policy) Bellman Equation for V π : a = π(s) V π s = s P ss a R ss a + γ s P a ss V π s = R(s, a) + γ s P a ss V π s Bellman Equation for V s : a = π(s) V s = max a s P a ss R a ss + γv s = R s, a + γmax a s P a ss V s Bellman equation for Q s, a Q a s, a = s P ss R a ss + γ max a Q s, a

LEARNING METHOD

A COMPLETE MODEL OF THE ENVIRONMENT S DYNAMICS IS GIVEN

Policy Evaluation Policy Evaluation: for a given policy p, compute the state-value f unction V π Recall: V π s = a π(a, s) s P a ss R a ss + γv π s A system of S simultaneous linear equations

Policy Evaluation

Policy Improvement Suppose we have computed V for a deterministic policy π. For a given state s, would it be better to do an action a π(s)? The value of doing a in state s is: Q π s, a = s P a ss R a ss = s P a ss R a ss + γv π (s ) + γv π a π s, a Q π s, a It is better to take an action a for state s if and only if Q π s, a > V π (s)

Policy Iteration

Other methods..

DEEP-Q LEARNING

Introduction We want to perform human-level control by using deep reinforcement learning. Reinforcement learning apply to Atari 2600 platforms presentative classic game.

Q-Networks Represent action value function by Q-network with weights w Q s, a; w Q (s, a)

Q-Learning Optimal Q-values should obey Bellman equation Bellman equation for Q s, a Q a s, a = s P ss Q a s, a; w = s P ss R a ss + γ max a R a ss + γ max a Q s, a Q s, a ; w = r + γ max a Q s, a ; w Treat right hand side r + γ max Q s, a ; w as a target a Minimize MSE loss by stochastic gradient descent

Deep Q-Networks

Deep Reinforcement Learning in Atari

DQN in Atari

Algorithm φ means stacking recent 4 images.

Architecture