Counterfactual Multi-Agent Policy Gradients

Size: px

Start display at page:

Download "Counterfactual Multi-Agent Policy Gradients"

Emery Carter
6 years ago
Views:

1 Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science University of Oxford joint work with Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, and Nantas Nardelli July 6, 2017 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

2 Single-Agent Paradigm Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

3 Multi-Agent Paradigm Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

4 Multi-Agent Systems are Everywhere Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

5 Types of Multi-Agent Systems Cooperative: Shared team reward Coordination problem Competitive: Zero-sum games Individual opposing rewards Minimax equilibria Mixed: General-sum games Nash equilibria What is the question? Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

6 Coordination Problems are Everywhere Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

7 Multi-Agent MDP All agents see the global state s Individual actions: u a U State transitions: P(s s, u) : S U S [0, 1] Shared team reward: r(s, u) : S U R Equivalent to an MDP with a factored action space Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

8 Dec-POMDP Observation function: O(s, a) : S A Z Action-observation history: τ a T (Z U) Decentralised policies: π a (u a τ a ) : T U [0, 1] Natural decentralisation: communication and sensory constraints Artificial decentralisation: coping with joint action space Centralised learning of decentralised policies Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

9 Key Challenges Curse of dimensionality in actions Multi-agent credit assignment Modelling other agents information state Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

10 Single-Agent Policy Gradient Methods Optimise π θ with gradient ascent on expected return: J θ = E s ρ π (s),u π θ (s, ) [r(s, u)] Good when: Greedification is hard, e.g., continuous actions Policy is simpler than value function Policy gradient theorem [Sutton et al. 2000]: θ J θ = E s ρ π (s),u π θ (s, ) [ θ log π θ (u s)q π (s, u)] REINFORCE [Williams 1992]: T θ J θ g(τ) = θ log π θ (u t s t )R t t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

11 Single-Agent Actor-Critic Methods [Sutton et al. 00] Reduce variance in g (τ ) by learning a critic Q(s, u): g (τ ) = T X θ log πθ (ut st )Q(st, ut ) t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

12 Single-Agent Baselines Further reduce variance with a baseline b(s): T g(τ) = θ log π θ (u t s t )(Q(s t, u t ) b(s t )) t=0 b(s) = V (s) = Q(s, u) b(s) = A(s, a), the advantage function: T g(τ) = θ log π θ (u t s t )A(s t, u t ) t=0 TD-error r t + γv (s t+1 ) V (s) is an unbiased estimate of A(s t, u t ): T g(τ) = θ log π θ (u t s t )(r t + γv (s t+1 ) V (s t )) t=0 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

13 Single-Agent Deep Actor-Critic Methods Actor and critic are both deep neural networks Convolutional and recurrent layers Actor and critic share layers Both trained with stochastic gradient descent Actor trained on policy gradient Critic trained on TD(λ) or Sarsa(λ): L t (ψ) = (y (λ) C( t, ψ)) 2 y (λ) = (1 λ) G (n) t = n=1 λ n 1 G (n) t n γ k 1 r t+k + γ n C( t+n, ψ) k=1 Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

14 Independent Actor-Critic Inspired by independent Q-learning [Tan 1993] Each agent learns independently with its own actor and critic Treats other agents as part of the environment Speed learning with parameter sharing Different inputs, including a, induce different behaviour Still independent: critics condition only on τ a and u a Variants: IAC-V: TD-error gradient using V (τ a ) IAC-Q: Advantage-based gradient using A(τ a, u a ) = Q(τ a, u a ) V (τ a ) Limitations: Nonstationary learning Hard to learn to coordinate Multi-agent credit assignment Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

15 Counterfactual Multi-Agent Policy Gradients Centralised critic: stabilise learning to coordinate Counterfactual baseline: tackle multi-agent credit assignment Efficient critic representation: scale to large NNs Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

16 Centralised Critic Centralisation Hard greedification actor-critic T g a (τ) = θ log π θ (ut a τt a )(r t + γv (s t+1 ) V (s t )) t=0 A 1 t (h 1, ) critic (h 2, ) A 2 t h 1 actor 1 u 1 t u 2 t actor 2 h 2 o 1 t u 1 t s t r t o 2 u 2 t t environment Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

17 Wonderful Life Utility [Wolpert & Tumer 2000] Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

18 Difference Rewards [Tumer & Agogino 2007] Per-agent shaped reward: where c a is a default action D a (s, u) = r(s, u) r(s, (u a, c a )) Important property: D a (s, (u a, u a )) > D a (s, u) = r(s, (u a, u a )) > r(s, (u a, a)) Limitations: Need (extra) simulation to estimate counterfactual r(s, (u a, c a )) Need expertise to choose c a Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

19 Counterfactual Baseline Use Q(s, u) to estimate difference rewards: T g a (τ) = θ log π θ (ut a τt a )A a (s t, u t ) t=0 A a (s, u) = Q(s, u) u a π a (u a τ a )Q(s, (u a, u a )) Baseline marginalises out u a Critic obviates need for extra simulations Marginalised action obviates need for default Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

20 Efficient Critic Representation h a t, ) A a t (u a t, a t ) COMA h a t U (h a t ) {Q(u a =1, u -a t,..),..,q(ua = U, u -a t,..)}, u a t-1 ) (u-a t, s t, oa t, a, u t-1 ) ) (c) Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

21 Starcraft Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

22 Starcraft Micromanagement [Synnaeve et al. 2016] Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

23 Centralised Performance Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

24 Decentralised Starcraft Micromanagement x Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

25 Heuristic Performance Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

26 Baseline Algorithms IAC-V: independent actor-critic with V (τ a ) IAC-Q: independent actor-critic with A(τ a, u a ) = Q(τ a, u a ) V (τ a ) Central-V: centralised critic V (s) with TD-error-based gradient Central-QV: Centralised critics Q(s, u) and V (s) Advantage gradient A(s, u) = Q(s, u) V (s) COMA but with b(s) = V (s) Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

27 Results (3m, 5m, 5w, 2d-3z) Average Win % Average Win % COMA central-v central-qv heuristic 20k 40k 60k 80k 100k120k140k # Episodes 5k 10k 15k 20k 25k 30k 35k # Episodes Average Win % Average Win % IAC-V IAC-Q 10k 20k 30k 40k 50k 60k 70k # Episodes 5k 10k 15k 20k 25k 30k 35k 40k # Episodes Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

28 Compared to Centralised Controllers Local Field of View (FoV) map heur. IAC-V IAC-Q cnt-v cnt-qv COMA mean best Full FoV, Central Control heur. DQN GMEZO 3m m w d 3z Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

29 Future Work Factored centralised critics for many agents Multi-agent exploration Starcraft macromanagement Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

30 Paper Counterfactual Multi-Agent Policy Gradients Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

31 Thank You Microsoft! This work was made possible thanks to a generous donation of Azure cloud credits from Microsoft. Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, / 31

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of