Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Size: px
Start display at page:

Download "Learning Control for Air Hockey Striking using Deep Reinforcement Learning"

Transcription

1 Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

2 Overview 1 Introduction Reinforcement Learning Q-Learning Deep Reinforcement Learning 2 The Air Hockey Striking Problem Goals The Optimal Control Problem The RL Formulation DQN for Air Hockey Guided-DQN components Results 3 Summary A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

3 The Reinforcement Learning Setup At each time step t: 1 Agent and environment in state s t 2 The agent executes action a t in the environment 3 The environment transitions to state s t+1 according to a t and s t 4 The agent observes state s t+1 and receives reward r t 5 t = t + 1 and set s t = s t+1 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

4 RL Components The RL goal is to maximize the expected accumulative reward: RL Objective max π Eπ [ T γ t r t ], π : S A t=0 The model is defined for the tuple <S, A, P, r>, where 1 A - Action space 2 S - State space; initial state s 0 3 P - Transition model; P(s s, a) 4 r - Reward ; r t = r(s t, a t ) 5 γ - Discount factor A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

5 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

6 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) The state-action value function can also be learned from off-policy samples < s t, a t, r t, s t+1 > [ ] Q(s t, a t ) Q(s t, a t ) + α }{{}}{{} rt + γ max Q(s t+1, a) Q(s t, a t ) a old value learning rate }{{}}{{} old value target Selected action: π(s) = argmax a Q(s, a) + E E Exploration A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

7 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

8 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

9 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) Define the objective function to be the MSE of the Temporal Difference error ( L(θ) = E s,a,r,s P[ r + γ max Q(s, a ; θ) a } {{} target ] ) 2 Q(s, a; θ) }{{} prediction A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

10 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

11 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ To reduce value overestimation we decouple the action selection from the action evaluation yielding ] ( ) 2 L DDQN (θ) = E s,a,r,s D[ r + γq(s, a ; ˆθ ) Q(s, a; θ) a = argmax Q(s, a; θ) a A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

12 Deep Q-Network Learning Scheme Initialize: 1 Experience replay buffer D 2 Online network with weights θ 3 Target network with weights ˆθ θ Start at state s 0 and repeat for M steps 1 In state s t execute action a t = max Q(s t, a; θ) / explore a 2 Observe reward r t and new state s t+1 3 Store transition < s t, a t, r t, s t+1 > in experience replay buffer D 4 Sample uniformly N transitions from D 5 Update Q-network Q(s, a; θ) according to L(θ) 6 Set s t s t+1 7 update target Q-network with ˆθ θ every C steps A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

13 The Air Hockey Physical Setup A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

14 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

15 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

16 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown Requirements: Maximum velocity after impact (puck) Puck is aimed to the center of the goal Puck s trajectory according to the selected skill A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

17 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

18 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

19 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 Simulation dynamical model: 2-D 2 nd order kinematics (integrator) with constraints Collision models are ideal, with restitution coefficient e = 0.99 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

20 Time Optimal Control Formulation Control problem minimize a k φ(s T ) + T 1 t=1 subject to s k+1 = f (s k, a k ) s (i) k a (j) k [S (i) min, S (i) max], i = 1,..., 8 [A (j) min, A(j) max], j = 1, 2 s 0 = s(0) φ(s T ) is the constraint on the final condition f (s k, a k ) is the physical model S min/max, A min/max are constraints on the space and actions respectively s 0, s T are the initial and final conditions The collision model is embedded within the state and final condition A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

21 The NN Controller Input - physical state vector of the game, i.e position and velocities s t R 8 Output - 25 Q values (5 actions in each axis) Network - Feedforward Neural Network with 4 Layers Activations - ReLU A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

22 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

23 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal r c is a reward given for striking the puck r v is a reward for maximal velocity in the direction of the goal r v = sign(v ) V 2 r d is a reward for the puck reaching the desired point at the opponent s goal { c x x g w r d = d( x xg c e w) x x g > w A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

24 Double DQN Applying the DQN scheme to the air hockey problem (testing for different update periods) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

25 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

26 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent Possible explanations: Continuous state space Physical model dynamics Lack of rewards A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

27 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

28 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process ɛ-greedy gets filtered in physical systems with inertia, the system functions as a low pass filter Local exploration needs to be around the optimum to work A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

29 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

30 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning Acting scheme: with probability ɛ p perform a guided episode a t = π g (s t ) with probability 1 ɛ p act according to the E/E scheme a t = { max Q(s t, a) + N t a random action with probability 1 ɛ with probability ɛ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

31 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

32 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events Every update set: C = C C r, C r 1 C update rate A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

33 Guided-DQN GDQN, Double DQN and Deep Deterministic Policy Gradients (DDPG) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

34 Control Profile and Trajectory (a) Profiles (b) Trajectory A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

35 Learning Simulation OpenAI Striking Simulation A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

36 Summary Conclusion Formalizing an optimal control problem as a learning problem. Combining different methods of exploration. Injecting prior knowledge into the exiting learning form experience framework. Addressing the problem of non-stationarity in experience. Future Work Extending the setting for a moving puck. Extending the learning scheme for continuous actions. Implementing the learning algorithm in physical environment. A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

37 Thats All Folks! Thank You A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning for NLP

Reinforcement Learning for NLP Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

CS230: Lecture 9 Deep Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Machine Learning I Continuous Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t

More information

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

Human-level control through deep reinforcement. Liia Butler

Human-level control through deep reinforcement. Liia Butler Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Approximation Methods in Reinforcement Learning

Approximation Methods in Reinforcement Learning 2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Contributions to deep reinforcement learning and its applications in smartgrids

Contributions to deep reinforcement learning and its applications in smartgrids 1/60 Contributions to deep reinforcement learning and its applications in smartgrids Vincent François-Lavet University of Liege, Belgium September 11, 2017 Motivation 2/60 3/60 Objective From experience

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5 Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................

More information

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Real Time Value Iteration and the State-Action Value Function

Real Time Value Iteration and the State-Action Value Function MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book

More information

Q-learning. Tambet Matiisen

Q-learning. Tambet Matiisen Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience

More information

Driving a car with low dimensional input features

Driving a car with low dimensional input features December 2016 Driving a car with low dimensional input features Jean-Claude Manoli jcma@stanford.edu Abstract The aim of this project is to train a Deep Q-Network to drive a car on a race track using only

More information

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net. 1 Bayes Nets Unfortunately during spring due to illness and allergies, Billy is unable to distinguish the cause (X) of his symptoms which could be: coughing (C), sneezing (S), and temperature (T). If he

More information

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Q-Learning. Dan Weld / University of Washington Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

arxiv: v1 [cs.lg] 20 Jun 2018

arxiv: v1 [cs.lg] 20 Jun 2018 RUDDER: Return Decomposition for Delayed Rewards arxiv:1806.07857v1 [cs.lg 20 Jun 2018 Jose A. Arjona-Medina Michael Gillhofer Michael Widrich Thomas Unterthiner Sepp Hochreiter LIT AI Lab Institute for

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Least squares policy iteration (LSPI)

Least squares policy iteration (LSPI) Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the

More information

15-780: ReinforcementLearning

15-780: ReinforcementLearning 15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information