Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Size: px

Start display at page:

Download "Learning Control for Air Hockey Striking using Deep Reinforcement Learning"

Shauna Perkins
5 years ago
Views:

1 Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

2 Overview 1 Introduction Reinforcement Learning Q-Learning Deep Reinforcement Learning 2 The Air Hockey Striking Problem Goals The Optimal Control Problem The RL Formulation DQN for Air Hockey Guided-DQN components Results 3 Summary A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

The Reinforcement Learning Setup At each time step t: 1 Agent and environment in state s t 2 The agent executes action a t in the environment 3 The environment transitions to state s t+1 according

3 The Reinforcement Learning Setup At each time step t: 1 Agent and environment in state s t 2 The agent executes action a t in the environment 3 The environment transitions to state s t+1 according to a t and s t 4 The agent observes state s t+1 and receives reward r t 5 t = t + 1 and set s t = s t+1 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

4 RL Components The RL goal is to maximize the expected accumulative reward: RL Objective max π Eπ [ T γ t r t ], π : S A t=0 The model is defined for the tuple <S, A, P, r>, where 1 A - Action space 2 S - State space; initial state s 0 3 P - Transition model; P(s s, a) 4 r - Reward ; r t = r(s t, a t ) 5 γ - Discount factor A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

5 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

6 Q-Learning, Watkins (1989) Q-Learning - a learning version of the value iteration algorithm. At each step s t, choose the action a t which maximizes the function Q(s t, a t ). Q is the estimated state-action value function it tells us how good an action is given a certain state. Formally: Q(s t, a t ) = r(s t, a t ) + γ max a Q(s t+1, a) The state-action value function can also be learned from off-policy samples < s t, a t, r t, s t+1 > [ ] Q(s t, a t ) Q(s t, a t ) + α }{{}}{{} rt + γ max Q(s t+1, a) Q(s t, a t ) a old value learning rate }{{}}{{} old value target Selected action: π(s) = argmax a Q(s, a) + E E Exploration A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

7 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

8 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

9 Deep Q-Learning State-action transitions < s, a, r, s > are sampled from Full episodes roll-outs Distribution function over an external memory r = r(s, a), s P( s, a) The state-action value function can be approximated by a neural network (parametric function approximator) Q π (s, a) Q(s, a; θ) Define the objective function to be the MSE of the Temporal Difference error ( L(θ) = E s,a,r,s P[ r + γ max Q(s, a ; θ) a } {{} target ] ) 2 Q(s, a; θ) }{{} prediction A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

10 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

11 DQN & Double DQN Learning Rule, Van Hasselt et al. (2016) In order to reduce oscillation and increase stability, we use Experience replay buffer D Additional freezed target Q-network Q(s, a; ˆθ ) L DQN (θ) = E s,a,r,s D[ ( r + γ max a Q(s, a ; ˆθ ) Q(s, a; θ) ) 2 ] Periodically update parameters ˆθ θ To reduce value overestimation we decouple the action selection from the action evaluation yielding ] ( ) 2 L DDQN (θ) = E s,a,r,s D[ r + γq(s, a ; ˆθ ) Q(s, a; θ) a = argmax Q(s, a; θ) a A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

12 Deep Q-Network Learning Scheme Initialize: 1 Experience replay buffer D 2 Online network with weights θ 3 Target network with weights ˆθ θ Start at state s 0 and repeat for M steps 1 In state s t execute action a t = max Q(s t, a; θ) / explore a 2 Observe reward r t and new state s t+1 3 Store transition < s t, a t, r t, s t+1 > in experience replay buffer D 4 Sample uniformly N transitions from D 5 Update Q-network Q(s, a; θ) according to L(θ) 6 Set s t s t+1 7 update target Q-network with ˆθ θ every C steps A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

13 The Air Hockey Physical Setup A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

14 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

15 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

16 Air Hockey Striking Problem Formulation Goal Finding a policy for the agent to strike the puck (skill), such that the puck will move in a desired attack pattern Assumptions: Puck is stationary The Agent has known physical limitations The agent s physical model and puck-mallet collision model are unknown Requirements: Maximum velocity after impact (puck) Puck is aimed to the center of the goal Puck s trajectory according to the selected skill A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

17 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

18 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

19 Agent s Dynamics State space: Combination of the agent s (m) and puck s (p) positions and velocities Action Space: s t = [ m x, m Vx, m y, m Vy, p x, p Vx, p y, p Vy ] T Acceleration commands (motor torques) [ ] T a x, a y, ax,y A max Discretized in order to fit the Q-networks scheme Boundary and zero actions are taken (among other) to include the known Bang-Zero-Bang profile A = [ A max, A max /2, 0, A max /2, A max ] 2 Simulation dynamical model: 2-D 2 nd order kinematics (integrator) with constraints Collision models are ideal, with restitution coefficient e = 0.99 A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

20 Time Optimal Control Formulation Control problem minimize a k φ(s T ) + T 1 t=1 subject to s k+1 = f (s k, a k ) s (i) k a (j) k [S (i) min, S (i) max], i = 1,..., 8 [A (j) min, A(j) max], j = 1, 2 s 0 = s(0) φ(s T ) is the constraint on the final condition f (s k, a k ) is the physical model S min/max, A min/max are constraints on the space and actions respectively s 0, s T are the initial and final conditions The collision model is embedded within the state and final condition A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

21 The NN Controller Input - physical state vector of the game, i.e position and velocities s t R 8 Output - 25 Q values (5 actions in each axis) Network - Feedforward Neural Network with 4 Layers Activations - ReLU A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

22 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

23 Reward Definition Reward Function { rt = rtime r t = r c + r v + r d if st is not terminal if s t is terminal r c is a reward given for striking the puck r v is a reward for maximal velocity in the direction of the goal r v = sign(v ) V 2 r d is a reward for the puck reaching the desired point at the opponent s goal { c x x g w r d = d( x xg c e w) x x g > w A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

Double DQN Applying the DQN scheme to the

24 Double DQN Applying the DQN scheme to the air hockey problem (testing for different update periods) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

25 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is

26 Challenges Observations: DQN takes a long time to start to rise (no feature learning) Best policy learned is suboptimal Sharp drop in score values - obtaining oscillating policy Average value is noisy and oscillating Learning error (TD) diverges DQN is very inconsistent Possible explanations: Continuous state space Physical model dynamics Lack of rewards A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

27 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

28 Exploration/Exploitation in Continuous Domains ɛ-greedy a t = { at random action with probability 1 ɛ with probability ɛ Local Exploration a t = a t + N t N t - temporally correlated random process ɛ-greedy gets filtered in physical systems with inertia, the system functions as a low pass filter Local exploration needs to be around the optimum to work A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

29 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

30 Learning Guidance with On-line Demonstrations Combine knowledge with present experience mechanism for one stage learning 1 With probability ɛ p instruct the agent to act according to π(s) for a full episode 2 Store transitions in replay buffer for learning Acting scheme: with probability ɛ p perform a guided episode a t = π g (s t ) with probability 1 ɛ p act according to the E/E scheme a t = { max Q(s t, a) + N t a random action with probability 1 ɛ with probability ɛ A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

31 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

32 Fixed Target Update Samples in the replay buffer (dataset) are not stationary. Fast updates can follow changes rapidly, but may diverge Slow updates can filter unstable changes, but may miss entire events Every update set: C = C C r, C r 1 C update rate A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

33 Guided-DQN GDQN, Double DQN and Deep Deterministic Policy Gradients (DDPG) (a) Left (b) Middle (c) Right (d) Random (e) Estimated value A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

34 Control Profile and Trajectory (a) Profiles (b) Trajectory A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

35 Learning Simulation OpenAI Striking Simulation A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

36 Summary Conclusion Formalizing an optimal control problem as a learning problem. Combining different methods of exploration. Injecting prior knowledge into the exiting learning form experience framework. Addressing the problem of non-stationarity in experience. Future Work Extending the setting for a moving puck. Extending the learning scheme for continuous actions. Implementing the learning algorithm in physical environment. A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

37 Thats All Folks! Thank You A. Taitler, N. Shimkin (Technion) Learning Control for Air Hockey Striking May 8, / 24

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline