Learning Control for Air Hockey Striking using Deep Reinforcement Learning
|
|
- Shauna Perkins
- 5 years ago
- Views:
Transcription
1 Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
2 Overview 1 Introduction Reinforcement Learning Q-Learning Deep Reinforcement Learning 2 The Air Hockey Striking Problem Goals The Optimal Control Problem The RL Formulation DQN for Air Hockey Guided-DQN components Results 3 Summary A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
3 The Reinforcement Learning Setup At each time step t: 1 Agent and environment in state s t 2 The agent executes action a t in the environment 3 The environment transitions to state s t+1 according to a t and s t 4 The agent observes state s t+1 and receives reward r t 5 t = t + 1 and set s t = s t+1 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
4 RL Components The RL goal is to maximize the expected accumulative reward: RL Objective max π Eπ [ T γ t r t ], π : S A t=0 The model is defined for the tuple <S, A, P, r>, where 1 A - Action space 2 S - State space; initial state s 0 3 P - Transition model; P(s s, a) 4 r - Reward ; r t = r(s t, a t ) 5 γ - Discount factor A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
5 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
6 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) The state-action value function can also be learned from off-policy samples < s t, a t, r t, s t+1 > [ ] Q(s t, a t ) Q(s t, a t ) + α }{{}}{{} rt + γ max Q(s t+1, a) Q(s t, a t ) a old value learning rate }{{}}{{} old value target Selected action: π(s) = argmax a Q(s, a) + E E Exploration A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
7 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
8 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
9 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) Define the objective function to be the MSE of the Temporal Difference error ( L(θ) = E s,a,r,s P[ r + γ max Q(s, a ; θ) a } {{} target ] ) 2 Q(s, a; θ) }{{} prediction A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
10 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
11 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ To reduce value overestimation we decouple the action selection from the action evaluation yielding ] ( ) 2 L DDQN (θ) = E s,a,r,s D[ r + γq(s, a ; ˆθ ) Q(s, a; θ) a = argmax Q(s, a; θ) a A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
12 Deep Q-Network Learning Scheme Initialize: 1 Experience replay buffer D 2 Online network with weights θ 3 Target network with weights ˆθ θ Start at state s 0 and repeat for M steps 1 In state s t execute action a t = max Q(s t, a; θ) / explore a 2 Observe reward r t and new state s t+1 3 Store transition < s t, a t, r t, s t+1 > in experience replay buffer D 4 Sample uniformly N transitions from D 5 Update Q-network Q(s, a; θ) according to L(θ) 6 Set s t s t+1 7 update target Q-network with ˆθ θ every C steps A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
13 The Air Hockey Physical Setup A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
14 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
15 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
16 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown Requirements: Maximum velocity after impact (puck) Puck is aimed to the center of the goal Puck s trajectory according to the selected skill A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
17 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
18 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
19 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 Simulation dynamical model: 2-D 2 nd order kinematics (integrator) with constraints Collision models are ideal, with restitution coefficient e = 0.99 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
20 Time Optimal Control Formulation Control problem minimize a k φ(s T ) + T 1 t=1 subject to s k+1 = f (s k, a k ) s (i) k a (j) k [S (i) min, S (i) max], i = 1,..., 8 [A (j) min, A(j) max], j = 1, 2 s 0 = s(0) φ(s T ) is the constraint on the final condition f (s k, a k ) is the physical model S min/max, A min/max are constraints on the space and actions respectively s 0, s T are the initial and final conditions The collision model is embedded within the state and final condition A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
21 The NN Controller Input - physical state vector of the game, i.e position and velocities s t R 8 Output - 25 Q values (5 actions in each axis) Network - Feedforward Neural Network with 4 Layers Activations - ReLU A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
22 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
23 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal r c is a reward given for striking the puck r v is a reward for maximal velocity in the direction of the goal r v = sign(v ) V 2 r d is a reward for the puck reaching the desired point at the opponent s goal { c x x g w r d = d( x xg c e w) x x g > w A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
24 Double DQN Applying the DQN scheme to the air hockey problem (testing for different update periods) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
25 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
26 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent Possible explanations: Continuous state space Physical model dynamics Lack of rewards A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
27 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
28 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process ɛ-greedy gets filtered in physical systems with inertia, the system functions as a low pass filter Local exploration needs to be around the optimum to work A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
29 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
30 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning Acting scheme: with probability ɛ p perform a guided episode a t = π g (s t ) with probability 1 ɛ p act according to the E/E scheme a t = { max Q(s t, a) + N t a random action with probability 1 ɛ with probability ɛ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
31 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
32 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events Every update set: C = C C r, C r 1 C update rate A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
33 Guided-DQN GDQN, Double DQN and Deep Deterministic Policy Gradients (DDPG) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
34 Control Profile and Trajectory (a) Profiles (b) Trajectory A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
35 Learning Simulation OpenAI Striking Simulation A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
36 Summary Conclusion Formalizing an optimal control problem as a learning problem. Combining different methods of exploration. Injecting prior knowledge into the exiting learning form experience framework. Addressing the problem of non-stationarity in experience. Future Work Extending the setting for a moving puck. Extending the learning scheme for continuous actions. Implementing the learning algorithm in physical environment. A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
37 Thats All Folks! Thank You A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationDeep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department
Deep reinforcement learning Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 25 In this lecture... Introduction to deep reinforcement learning Value-based Deep RL Deep
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationReinforcement Learning for NLP
Reinforcement Learning for NLP Advanced Machine Learning for NLP Jordan Boyd-Graber REINFORCEMENT OVERVIEW, POLICY GRADIENT Adapted from slides by David Silver, Pieter Abbeel, and John Schulman Advanced
More informationReinforcement Learning Part 2
Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationCS230: Lecture 9 Deep Reinforcement Learning
CS230: Lecture 9 Deep Reinforcement Learning Kian Katanforoosh Menti code: 21 90 15 Today s outline I. Motivation II. Recycling is good: an introduction to RL III. Deep Q-Learning IV. Application of Deep
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationMachine Learning I Continuous Reinforcement Learning
Machine Learning I Continuous Reinforcement Learning Thomas Rückstieß Technische Universität München January 7/8, 2010 RL Problem Statement (reminder) state s t+1 ENVIRONMENT reward r t+1 new step r t
More informationMultiagent (Deep) Reinforcement Learning
Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationHuman-level control through deep reinforcement. Liia Butler
Humanlevel control through deep reinforcement Liia Butler But first... A quote "The question of whether machines can think... is about as relevant as the question of whether submarines can swim" Edsger
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationApproximation Methods in Reinforcement Learning
2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Reinforcement
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationReinforcement Learning. George Konidaris
Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationReinforcement Learning: An Introduction
Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationCOMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning COMP9444 17s2 Deep Reinforcement Learning 1 Outline History of Reinforcement Learning Deep Q-Learning for Atari Games Actor-Critic
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationContributions to deep reinforcement learning and its applications in smartgrids
1/60 Contributions to deep reinforcement learning and its applications in smartgrids Vincent François-Lavet University of Liege, Belgium September 11, 2017 Motivation 2/60 3/60 Objective From experience
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationReinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge
Reinforcement Learning with Function Approximation KATJA HOFMANN Researcher, MSR Cambridge Representation and Generalization in RL Focus on training stability Learning generalizable value functions Navigating
More informationDavid Silver, Google DeepMind
Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationDeep Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationPolicy Gradient Methods. February 13, 2017
Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationDeep Reinforcement Learning via Policy Optimization
Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017 Introduction Deep Reinforcement Learning: What to Learn? Policies (select next action) Deep Reinforcement Learning: What to
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationReal Time Value Iteration and the State-Action Value Function
MS&E338 Reinforcement Learning Lecture 3-4/9/18 Real Time Value Iteration and the State-Action Value Function Lecturer: Ben Van Roy Scribe: Apoorva Sharma and Tong Mu 1 Review Last time we left off discussing
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationReinforcement Learning Active Learning
Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationQ-learning. Tambet Matiisen
Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth) Stochastic gradient descent Experience
More informationDriving a car with low dimensional input features
December 2016 Driving a car with low dimensional input features Jean-Claude Manoli jcma@stanford.edu Abstract The aim of this project is to train a Deep Q-Network to drive a car on a race track using only
More informationCS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm
CS294-112 Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm 1 Introduction The goal of this assignment is to experiment with policy gradient and its variants, including
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationFigure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.
1 Bayes Nets Unfortunately during spring due to illness and allergies, Billy is unable to distinguish the cause (X) of his symptoms which could be: coughing (C), sneezing (S), and temperature (T). If he
More informationApproximate Q-Learning. Dan Weld / University of Washington
Approximate Q-Learning Dan Weld / University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Q Learning
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationDeep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017
Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationReinforcement Learning II. George Konidaris
Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationCSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes
CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not
More informationCS 188 Introduction to Fall 2007 Artificial Intelligence Midterm
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.
More informationarxiv: v1 [cs.lg] 20 Jun 2018
RUDDER: Return Decomposition for Delayed Rewards arxiv:1806.07857v1 [cs.lg 20 Jun 2018 Jose A. Arjona-Medina Michael Gillhofer Michael Widrich Thomas Unterthiner Sepp Hochreiter LIT AI Lab Institute for
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationReinforcement Learning
Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63 Goal of this lecture So far
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationLeast squares policy iteration (LSPI)
Least squares policy iteration (LSPI) Charles Elkan elkan@cs.ucsd.edu December 6, 2012 1 Policy evaluation and policy improvement Let π be a non-deterministic but stationary policy, so p(a s; π) is the
More information15-780: ReinforcementLearning
15-780: ReinforcementLearning J. Zico Kolter March 2, 2016 1 Outline Challenge of RL Model-based methods Model-free methods Exploration and exploitation 2 Outline Challenge of RL Model-based methods Model-free
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns
More informationPART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.
Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More information