(Deep) Reinforcement Learning

Similar documents
Deep Reinforcement Learning

David Silver, Google DeepMind

REINFORCEMENT LEARNING

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning for NLP

Q-Learning in Continuous State Action Spaces

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Introduction to Reinforcement Learning

Reinforcement Learning Part 2

Approximation Methods in Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Human-level control through deep reinforcement. Liia Butler

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning

Lecture 9: Policy Gradient II 1

Grundlagen der Künstlichen Intelligenz

Trust Region Policy Optimization

Q-learning. Tambet Matiisen

CS230: Lecture 9 Deep Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning and Deep Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Introduction

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Multiagent (Deep) Reinforcement Learning

Lecture 7: Value Function Approximation

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning: the basics

Decision Theory: Q-Learning

Lecture 1: March 7, 2018

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Artificial Intelligence

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning via Policy Optimization

Policy Gradient Methods. February 13, 2017

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Introduction of Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Off-Policy Actor-Critic

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Multiagent Value Iteration in Markov Games

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Reinforcement Learning

Internet Monetization

Reinforcement Learning II

, and rewards and transition matrices as shown below:

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

CS599 Lecture 1 Introduction To RL

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Temporal Difference Learning & Policy Iteration

Machine Learning I Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 3: Markov Decision Processes

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

Planning in Markov Decision Processes

Reinforcement Learning as Variational Inference: Two Recent Approaches

arxiv: v1 [cs.lg] 20 Jun 2018

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement learning

Grundlagen der Künstlichen Intelligenz

Lecture 3: The Reinforcement Learning Problem

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Real Time Value Iteration and the State-Action Value Function

Lecture 8: Policy Gradient

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Temporal difference learning

An Introduction to Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Counterfactual Multi-Agent Policy Gradients

Deep Reinforcement Learning. Scratching the surface

Lecture 10 - Planning under Uncertainty (III)

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

State Space Abstractions for Reinforcement Learning

Reinforcement Learning. George Konidaris

Lecture 8: Policy Gradient I 2

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Deep Reinforcement Learning via Policy Optimization

Notes on Reinforcement Learning

Transcription:

Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17

Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015 learning what to do to maximize future reward general-purpose framework extending sequential decision making when the model of the environment is unknown Martin Matyášek VPD, 2016 2 / 17

RL background let s assume MDP S, A, P, R, s 0 RL deals with situation where the environment model P and R is unknown can be generalized to Stochastic Games S, N, A, P, R RL agent includes: policy a = π(s) (deterministic), π(a s) = P(a s) (stochastic) value function Q π (a s) = E[ t=0 γt r t s, a] Bellman Eq.: Q π (a s) = E s,a [r + γqπ (a s ) s, a] opt. value funcions: Q (s, a) = E s [r + γ max a Q (s, a ) s, a] opt. policy: π (s) = argmax a Q (s, a) model - learned proxy for environment Martin Matyášek VPD, 2016 3 / 17

RL Types 1. value-based RL estimate the opt. value function Q (s, a) max. value achievable under any policy 2. policy-based RL search directly for the opt. policy π i.e. policy yielding max. future reward 3. model-based RL build a model of the environment plan using this model Martin Matyášek VPD, 2016 4 / 17

Q-learning Algorithm 1 Q-learning 1: initialize the Q-function and V values (arbitrarily) 2: repeat 3: observe the current state s t 4: select action a t and take it 5: observe the reward R(s t, a t, s t+1 ) 6: Q t+1 (s t, a t ) (1 α t )Q t (s t, a t ) + α t (R(s t, a t, s t+1 ) + γv t (s t+1 )) 7: V t+1 (s) max a Q t (s, a) 8: until convergence Martin Matyášek VPD, 2016 5 / 17

Q-learning model-free method temporal-difference version: Q(s, a) Q(s, a) + α(r + γ max Q(s, a ) Q(s, a)) a }{{} value based on next state converges to Q, V iff 0 α t <, t=0 α t = and t=0 α2 t < zero-sum Stochastic Games: cannot simply use Q π i : S A i R but rather Q π i : S A R minimax-q converges to NE R-max: converge to ɛ Nash with prob. (1 δ) in poly. # steps (PAC learn) Martin Matyášek VPD, 2016 6 / 17

Q (s, a) Q(s, a, w) Q-Networks treat right hand side r + γ max a Q(s, a, w) of Bellman s Eq. as target minimize MSE loss l = (r + max a Q(s, a, w) Q(s, a, w)) 2 by stochastic gradient descent David Silver, Google DeepMind Martin Matyášek VPD, 2016 7 / 17

Q-learning summary + converges to Q using table lookup representation diverges using NN: correlations between samples non-stationary targets! go deep Martin Matyášek VPD, 2016 8 / 17

Deep Q-Networks (DQN) basic approach is called experience replay idea: remove correlations by building data-set from agent s experience e = (s, a, r, s ) sample experiences from D t = {e 1, e 2,... e t } and apply update deal with non-stationarity by fixing w in l = (r + γ max a Q(s, a, w ) Q(s, a, w)) 2 Martin Matyášek VPD, 2016 9 / 17

Algorithm 2 Deep Q-learning algorithm 1: init. replay memory D, init. Q with random weights 2: observe initial state s 3: repeat 4: with prob. ɛ select random a, select a = argmax a Q(s, a ) 5: carry out a, observe (r, s ) and store (s, a, r, s ) in D 6: sample random transition (ss, aa, rr, ss ) from D 7: calculate target for each minibatch transition: 8: if ss is terminal state then 9: tt rr 10: else 11: tt rr + γ max a Q(ss, aa ) 12: end if 13: train the Q-network using (tt Q(ss, aa)) 2 as loss 14: s s 15: until convergence Martin Matyášek VPD, 2016 10 / 17

DQN in Atari David Silver, Google DeepMind Martin Matyášek VPD, 2016 11 / 17

DQN in Atari - setting state stack of raw pixels from last 4 frames actions 18 joystick/button positions reward delta in score learn Q(s, a) David Silver, Google DeepMind Martin Matyášek VPD, 2016 12 / 17

DQN in Atari - results Martin Matyášek VPD, 2016 13 / 17

DQN improvements Double DQN removes bias caused by max a Q( ) current QN - w used to select actions old QN - w used to evaluate actions Prioritized Replay weight experience according to DQN error (stored in PQ) Duelling Network split Q-Network into: action-independent value function action-dependent advantage function Martin Matyášek VPD, 2016 14 / 17

General Reinforcement Learning Architecture (GORILA) Martin Matyášek VPD, 2016 15 / 17

Deep Policy Network parametrize the policy π by a DNN and use SGD to optimize weights u π(a s, u) or π(s, u) max u L(u) = E[ t=0 γt r t π(, u)] policy gradients: L(u) u L(u) u log π(a s,u) = E[ u Q π (s, a)] for stochastic policy π(a s, u) = E[ Qπ (s,a) a a u ] for deterministic policy a = π(s) where a is cont. and Q diff. Martin Matyášek VPD, 2016 16 / 17

Next time.. cont. policy-based deep RL: Actor-Critic alg., A3C Fictitious Self-Play model-based deep RL Martin Matyášek VPD, 2016 17 / 17