Multiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone

Size: px
Start display at page:

Download "Multiagent Learning. Foundations and Recent Trends. Stefano Albrecht and Peter Stone"

Transcription

1 Multiagent Learning Foundations and Recent Trends Stefano Albrecht and Peter Stone Tutorial at IJCAI 2017 conference:

2 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 1

3 Multiagent Systems Environment Multiple agents interact in common environment Each agent with own sensors, effectors, goals,... Agents have to coordinate actions to achieve goals Agent Goals Actions Domain knowledge sensors effectors Agent Goals Actions Domain knowledge S. Albrecht, P. Stone 2

4 Multiagent Systems Environment defined by: state space Environment available actions effects of actions on states what agents can observe Agents defined by: domain knowledge goal specification policies for selecting actions Agent Goals Actions Domain knowledge sensors effectors Agent Goals Actions Domain knowledge S. Albrecht, P. Stone 3

5 Multiagent Systems Environment defined by: state space Environment available actions effects of actions on states what agents can observe Agents defined by: domain knowledge goal specification policies for selecting actions Agent Goals Actions Domain knowledge sensors effectors Agent Goals Actions Domain knowledge Many problems can be modelled as multiagent systems! S. Albrecht, P. Stone 3

6 Multiagent Systems Applications Chess Poker Starcraft S. Albrecht, P. Stone 4

7 Multiagent Systems Applications Chess Poker Starcraft Robot soccer Home assistance Autonomous cars S. Albrecht, P. Stone 4

8 Multiagent Systems Applications Negotiation S. Albrecht, P. Stone Wireless networks Smart grid 5

9 Multiagent Systems Applications Negotiation Wireless networks User interfaces S. Albrecht, P. Stone Smart grid Multi-robot rescue 5

10 Multiagent Learning Multiagent learning Learning is process of improving performance via experience Can agents learn to coordinate actions with other agents? What to learn? S. Albrecht, P. Stone 6

11 Multiagent Learning Multiagent learning Learning is process of improving performance via experience Can agents learn to coordinate actions with other agents? What to learn? How to select own actions S. Albrecht, P. Stone 6

12 Multiagent Learning Multiagent learning Learning is process of improving performance via experience Can agents learn to coordinate actions with other agents? What to learn? How to select own actions How other agents select actions S. Albrecht, P. Stone 6

13 Multiagent Learning Multiagent learning Learning is process of improving performance via experience Can agents learn to coordinate actions with other agents? What to learn? How to select own actions How other agents select actions Other agents goals, plans, beliefs,... S. Albrecht, P. Stone 6

14 Multiagent Learning Why learning? Domain too complex to solve by hand or with multiagent planning e.g. computing equilibrium solutions in games S. Albrecht, P. Stone 7

15 Multiagent Learning Why learning? Domain too complex to solve by hand or with multiagent planning e.g. computing equilibrium solutions in games Elements of domain unknown e.g. observation probabilities, behaviours of other agents,... Multiagent planning requires complete model S. Albrecht, P. Stone 7

16 Multiagent Learning Why learning? Domain too complex to solve by hand or with multiagent planning e.g. computing equilibrium solutions in games Elements of domain unknown e.g. observation probabilities, behaviours of other agents,... Multiagent planning requires complete model Other agents may learn too Have to adapt continually! Moving target problem central issue in multiagent learning S. Albrecht, P. Stone 7

17 Research in Multiagent Learning Multiagent learning studied in different communities AI, game theory, robotics, psychology,... S. Albrecht, P. Stone 8

18 Research in Multiagent Learning Multiagent learning studied in different communities AI, game theory, robotics, psychology,... Some conferences & journals: AAMAS, AAAI, IJCAI, NIPS, UAI, ICML, ICRA, IROS, RSS, PRIMA, JAAMAS, AIJ, JAIR, MLJ, JMLR,... Very large + growing body of work! S. Albrecht, P. Stone 8

19 Research in Multiagent Learning Multiagent learning studied in different communities AI, game theory, robotics, psychology,... Some conferences & journals: AAMAS, AAAI, IJCAI, NIPS, UAI, ICML, ICRA, IROS, RSS, PRIMA, JAAMAS, AIJ, JAIR, MLJ, JMLR,... Very large + growing body of work! Many algorithms proposed to address different assumptions (constraints), learning goals, performance criteria,... S. Albrecht, P. Stone 8

20 Tutorial This tutorial: Introduction to basics of multiagent learning: Interaction models & assumptions Learning goals Selection of learning algorithms Plus some recent trends S. Albrecht, P. Stone 9

21 Tutorial This tutorial: Introduction to basics of multiagent learning: Interaction models & assumptions Learning goals Selection of learning algorithms Plus some recent trends Further reading: AIJ Special Issue Foundations of Multi-Agent Learning Rakesh Vohra, Michael Wellman (eds.), 2007 Surveys: Tuyls and Weiss (2012); Busoniu et al. (2008); Panait and Luke (2005); Shoham et al. (2003); Alonso et al. (2001); Stone and Veloso (2000); Sen and Weiss (1999) Our own upcoming survey on agents modelling other agents! S. Albrecht, P. Stone 9

22 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 10

23 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 11

24 Multiagent Models & Assumptions Standard multiagent models: Normal-form game Repeated game Stochastic game Assumptions and other models S. Albrecht, P. Stone 12

25 Normal-Form Game Normal-form game consists of: Finite set of agents N = {1,..., n} For each agent i N: Finite set of actions A i Utility function u i : A R, where A = A 1... A n S. Albrecht, P. Stone 13

26 Normal-Form Game Normal-form game consists of: Finite set of agents N = {1,..., n} For each agent i N: Finite set of actions A i Utility function u i : A R, where A = A 1... A n Each agent i selects policy π i : A i [0, 1], takes action a i A i with probability π i (a i ), and receives utility u i (a 1,..., a n ) Given policy profile (π 1,..., π n ), expected utility to i is U i (π 1,..., π n ) = a A π 1 (a 1 )... π n (a n ) u i (a) Agents want to maximise their expected utilities S. Albrecht, P. Stone 13

27 Normal-Form Game: Prisoner s Dilemma Example: Prisoner s Dilemma Two prisoners questioned in isolated cells Each prisoner can Cooperate or Defect Utilities (row = agent 1, column = agent 2): C D C -1,-1-5,0 D 0,-5-3,-3 S. Albrecht, P. Stone 14

28 Normal-Form Game: Chicken Example: Chicken Two opposite drivers on same lane Each driver can Stay on lane or Leave lane Utilities: S L S 0,0 7,2 L 2,7 6,6 S. Albrecht, P. Stone 15

29 Normal-Form Game: Rock-Paper-Scissors Example: Rock-Paper-Scissors Two players, three actions Rock beats Scissors beats Paper beats Rock Utilities: R P S R 0,0-1,1 1,-1 P 1,-1 0,0-1,1 S -1,1 1,-1 0,0 S. Albrecht, P. Stone 16

30 Repeated Game Learning requires experience Normal-form game is single interaction No experience! Experience comes from repeated interactions S. Albrecht, P. Stone 17

31 Repeated Game Learning requires experience Normal-form game is single interaction No experience! Experience comes from repeated interactions Repeated game: Repeat same normal-form game: at each time t, each agent i chooses action a t i and gets utility u i (a t 1,..., at n) Policy π i : H A i [0, 1] assigns action probabilities based on history of interaction H = t N 0H t, H t = { H t = (a 0, a 1,..., a t 1 ) a τ A } S. Albrecht, P. Stone 17

32 Repeated Game What is expected utility to i for policy profile (π 1,..., π n )? Repeating game t N times: U i (π 1,..., π n ) = H t H t P(H t π 1,..., π n ) t 1 u i (a τ ) τ=0 P(H t π 1,..., π n ) = t 1 τ=0 j N π j (H τ, a τ j ) S. Albrecht, P. Stone 18

33 Repeated Game What is expected utility to i for policy profile (π 1,..., π n )? Repeating game times: U i (π 1,..., π n ) = lim P(H t π 1,..., π n ) γ τ u i (a τ ) t H t τ Discount factor 0 γ < 1 makes expectation finite Interpretation: low γ is myopic, high γ is farsighted (Or: probability that game will end at each time is 1 γ) S. Albrecht, P. Stone 19

34 Repeated Game What is expected utility to i for policy profile (π 1,..., π n )? Repeating game times: U i (π 1,..., π n ) = lim P(H t π 1,..., π n ) γ τ u i (a τ ) t H t τ Discount factor 0 γ < 1 makes expectation finite Interpretation: low γ is myopic, high γ is farsighted (Or: probability that game will end at each time is 1 γ) Can also define expected utility as limit average S. Albrecht, P. Stone 19

35 Repeated Game: Prisoner s Dilemma Example: Repeated Prisoner s Dilemma Example policies: C D C -1,-1-5,0 D 0,-5-3,-3 At time t, choose C with probability (t + 1) 1 Grim: chose C until opponent s first D, then choose D forever Tit-for-Tat: begin C, then repeat opponent s last action S. Albrecht, P. Stone 20

36 Repeated Game: Rock-Paper-Scissors Example: Repeated Rock-Paper-Scissors Example policy: R P S R 0,0-1,1 1,-1 P 1,-1 0,0-1,1 S -1,1 1,-1 0,0 Compute empirical frequency of opponent actions over past 5 moves P(a j ) = 1 5 t 1 τ=t 5 [a τ j = a j ] 1 and take best-response action max ai a j P(a j ) u i (a i, a j ) S. Albrecht, P. Stone 21

37 Stochastic Game Agents interact in common environment Environment has states, actions have effect on state Agents choose actions based on state-action history Example: Pursuit (e.g. Barrett et al., 2011) Predator agents must capture prey State: agent positions Actions: move to neighbouring cell S. Albrecht, P. Stone 22

38 Stochastic Game Stochastic game consists of: Finite set of agents N = {1,..., n} Finite set of states S For each agent i N: Finite set of actions A i Utility function u i : S A R, where A = A 1... A n State transition function T : S A S [0, 1] S. Albrecht, P. Stone 23

39 Stochastic Game Stochastic game consists of: Finite set of agents N = {1,..., n} Finite set of states S For each agent i N: Finite set of actions A i Utility function u i : S A R, where A = A 1... A n State transition function T : S A S [0, 1] Generalises Markov decision process (MDP) to multiple agents S. Albrecht, P. Stone 23

40 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... S. Albrecht, P. Stone 24

41 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... observes current state s t and past joint action a t 1 (if t > 0) S. Albrecht, P. Stone 24

42 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... observes current state s t and past joint action a t 1 (if t > 0) chooses action a t i A i with probability π i (H t, a t i ) where H t = (s 0, a 0, s 1, a 1,..., s t 1 ) is state-action history S. Albrecht, P. Stone 24

43 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... observes current state s t and past joint action a t 1 (if t > 0) chooses action a t i A i with probability π i (H t, a t i ) where H t = (s 0, a 0, s 1, a 1,..., s t 1 ) is state-action history receives utility u i (a t 1,..., at n) S. Albrecht, P. Stone 24

44 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... observes current state s t and past joint action a t 1 (if t > 0) chooses action a t i A i with probability π i (H t, a t i ) where H t = (s 0, a 0, s 1, a 1,..., s t 1 ) is state-action history receives utility u i (a t 1,..., at n) Game transitions into next state s t+1 S with probability T(s t, a t, s t+1 ) S. Albrecht, P. Stone 24

45 Stochastic Game Game starts in initial state s 0 S At each time t: Each agent i... observes current state s t and past joint action a t 1 (if t > 0) chooses action a t i A i with probability π i (H t, a t i ) where H t = (s 0, a 0, s 1, a 1,..., s t 1 ) is state-action history receives utility u i (a t 1,..., at n) Game transitions into next state s t+1 S with probability T(s t, a t, s t+1 ) Process repeated finite or infinite number of times, or until terminal state is reached (e.g. prey captured). S. Albrecht, P. Stone 24

46 Stochastic Game: Level-Based Foraging Example: Level-Based Foraging (Albrecht and Ramamoorthy, 2013) Agents (circles) must collect all items (squares) State: agent positions, item positions, which items collected Actions: move to neighbouring cell, try to collect item S. Albrecht, P. Stone 25

47 Stochastic Game: Soccer Keepaway Example: Soccer Keepaway (Stone et al., 2005) Keeper agents must keep ball away from Taker agents State: player positions & orientations, ball position,... Actions: go to ball, pass ball to player,... S. Albrecht, P. Stone 26

48 Stochastic Game: Soccer Keepaway Video: 4 vs 3 Keepaway Source: S. Albrecht, P. Stone 27

49 Assumptions Models and algorithms make assumptions, e.g. What do agents know about the game? What can agents observe during a game? S. Albrecht, P. Stone 28

50 Assumptions Models and algorithms make assumptions, e.g. What do agents know about the game? What can agents observe during a game? Usual assumptions: Game elements known (state/action space, utility function,...) Game states & chosen actions are commonly observed full observability or perfect information S. Albrecht, P. Stone 28

51 Assumptions Models and algorithms make assumptions, e.g. What do agents know about the game? What can agents observe during a game? Usual assumptions: Game elements known (state/action space, utility function,...) Game states & chosen actions are commonly observed full observability or perfect information Many learning algorithms designed for repeated/stochastic game with full observability But assumptions may vary and other models exist! S. Albrecht, P. Stone 28

52 Other Models Other assumptions & models: Assumption: elements of game unknown Bayesian game, stochastic Bayesian game AAAI 16 tutorial Type-based Methods for Interaction in Multiagent Systems Assumption: partial observability of states and actions Extensive-form game with imperfect information Partially observable stochastic game (POSG) Multiagent POMDPs: Dec-POMDP, I-POMDP,... S. Albrecht, P. Stone 29

53 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 30

54 Learning Goals Learning is to improve performance via experience But what is goal (end-result) of learning process? How to measure success of learning? S. Albrecht, P. Stone 31

55 Learning Goals Learning is to improve performance via experience But what is goal (end-result) of learning process? How to measure success of learning? Many learning goals proposed: Minimax/Nash/correlated equilibrium Pareto-optimality Social welfare & fairness No-regret Targeted optimality & safety... plus combinations & approximations S. Albrecht, P. Stone 31

56 Maximin/Minimax Two-player zero-sum game: u i = u j e.g. Rock-Paper-Scissors, Chess S. Albrecht, P. Stone 32

57 Maximin/Minimax Two-player zero-sum game: u i = u j e.g. Rock-Paper-Scissors, Chess Policy profile (π i, π j ) is maximin/minimax profile if U i (π i, π j ) = max π i min π j U i (π i, π j ) = min π j max U i (π π i, π j ) = U j(π i, π j ) j Utility that can be guaranteed against worst-case opponent S. Albrecht, P. Stone 32

58 Maximin/Minimax Two-player zero-sum game: u i = u j e.g. Rock-Paper-Scissors, Chess Policy profile (π i, π j ) is maximin/minimax profile if U i (π i, π j ) = max π i min π j U i (π i, π j ) = min π j max U i (π π i, π j ) = U j(π i, π j ) j Utility that can be guaranteed against worst-case opponent Every two-player zero-sum normal-form game has minimax profile (von Neumann and Morgenstern, 1944) Every finite or infinite+discounted zero-sum stochastic game has minimax profile (Shapley, 1953) S. Albrecht, P. Stone 32

59 Nash Equilibrium Policy profile π = (π 1,..., π n ) is Nash equilibrium (NE) if i π i : U i (π i, π i) U i (π) No agent can improve utility by unilaterally deviating from profile (every agent plays best-response to other agents) S. Albrecht, P. Stone 33

60 Nash Equilibrium Policy profile π = (π 1,..., π n ) is Nash equilibrium (NE) if i π i : U i (π i, π i) U i (π) No agent can improve utility by unilaterally deviating from profile (every agent plays best-response to other agents) Every finite normal-form game has at least one NE (Nash, 1950) (also stochastic games, e.g. Fink (1964)) Standard solution in game theory In two-player zero-sum game, minimax is same as NE S. Albrecht, P. Stone 33

61 Nash Equilibrium Example Example: Prisoner s Dilemma Only NE in normal-form game is (D,D) Normal-form NE are also NE in infinite repeated game Infinite repeated game has many more NE Folk theorem C D C -1,-1-5,0 D 0,-5-3,-3 S. Albrecht, P. Stone 34

62 Nash Equilibrium Example Example: Prisoner s Dilemma Only NE in normal-form game is (D,D) Normal-form NE are also NE in infinite repeated game Infinite repeated game has many more NE Folk theorem Example: Rock-Paper-Scissors Only NE in normal-form game is π i = π j = ( 1 3, 1 3, 1 3 ) C D C -1,-1-5,0 D 0,-5-3,-3 R P S R 0,0-1,1 1,-1 P 1,-1 0,0-1,1 S -1,1 1,-1 0,0 S. Albrecht, P. Stone 34

63 Correlated Equilibrium Each agent i observes signal x i with joint distribution ξ(x 1,..., x n ) E.g. x i is action recommendation to agent i S. Albrecht, P. Stone 35

64 Correlated Equilibrium Each agent i observes signal x i with joint distribution ξ(x 1,..., x n ) E.g. x i is action recommendation to agent i (π 1,..., π n ) is correlated equilibrium (CE) (Aumann, 1974) if no agent can individually improve its expected utility by deviating from recommended actions NE is subset of CE no correlation CE easier to compute than NE linear program S. Albrecht, P. Stone 35

65 Correlated Equilibrium Example Example: Chicken Correlated equilibrium: ξ(l, L) = ξ(s, L) = ξ(l, S) = 1 3 ξ(s, S) = 0 Expected utility to both: = 5 S L S 0,0 7,2 L 2,7 6,6 S. Albrecht, P. Stone 36

66 Correlated Equilibrium Example Example: Chicken Correlated equilibrium: ξ(l, L) = ξ(s, L) = ξ(l, S) = 1 3 ξ(s, S) = 0 Expected utility to both: = 5 S L S 0,0 7,2 L 2,7 6,6 Nash equilibrium utilities: π i (S) = 1, π j (S) = 0 (7, 2) π i (S) = 0, π j (S) = 1 (2, 7) π i (S) = 1 3, π j(s) = S. Albrecht, P. Stone 36

67 The Equilibrium Legacy The Equilibrium Legacy in multiagent learning: Quickly adopted equilibrium as standard goal of learning But equilibrium (e.g. NE) has many limitations... S. Albrecht, P. Stone 37

68 The Equilibrium Legacy The Equilibrium Legacy in multiagent learning: Quickly adopted equilibrium as standard goal of learning But equilibrium (e.g. NE) has many limitations Non-uniqueness Often multiple NE exist, how should agents choose same one? S. Albrecht, P. Stone 37

69 The Equilibrium Legacy The Equilibrium Legacy in multiagent learning: Quickly adopted equilibrium as standard goal of learning But equilibrium (e.g. NE) has many limitations Non-uniqueness Often multiple NE exist, how should agents choose same one? 2. Incompleteness NE does not specify behaviours for off-equilibrium paths S. Albrecht, P. Stone 37

70 The Equilibrium Legacy The Equilibrium Legacy in multiagent learning: Quickly adopted equilibrium as standard goal of learning But equilibrium (e.g. NE) has many limitations Non-uniqueness Often multiple NE exist, how should agents choose same one? 2. Incompleteness NE does not specify behaviours for off-equilibrium paths 3. Sup-optimality NE not generally same as utility maximisation S. Albrecht, P. Stone 37

71 The Equilibrium Legacy The Equilibrium Legacy in multiagent learning: Quickly adopted equilibrium as standard goal of learning But equilibrium (e.g. NE) has many limitations Non-uniqueness Often multiple NE exist, how should agents choose same one? 2. Incompleteness NE does not specify behaviours for off-equilibrium paths 3. Sup-optimality NE not generally same as utility maximisation 4. Rationality NE assumes all agents are rational (= perfect utility maximisers) S. Albrecht, P. Stone 37

72 Pareto Optimum Policy profile π = (π 1,..., π n ) is Pareto-optimal if there is no other profile π such that i : U i (π ) U i (π) and i : U i (π ) > U i (π) Can t improve one agent without making other agent worse off S. Albrecht, P. Stone 38

73 Pareto Optimum Policy profile π = (π 1,..., π n ) is Pareto-optimal if there is no other profile π such that i : U i (π ) U i (π) and i : U i (π ) > U i (π) Can t improve one agent without making other agent worse off 5 4 Payoff to player , 1 4, 1 1, 4 3, 3 Pareto-front is set of all Pareto-optimal utilities (red line) Payoff to player 1 S. Albrecht, P. Stone 38

74 Social Welfare & Fairness Pareto-optimality says nothing about social welfare and fairness S. Albrecht, P. Stone 39

75 Social Welfare & Fairness Pareto-optimality says nothing about social welfare and fairness Welfare and fairness of profile π = (π 1,..., π n ) often defined as Welfare(π) = i U i (π) Fairness(π) = i U i (π) π welfare/fairness-optimal if maximum Welfare(π)/Fairness(π) S. Albrecht, P. Stone 39

76 Social Welfare & Fairness Pareto-optimality says nothing about social welfare and fairness Welfare and fairness of profile π = (π 1,..., π n ) often defined as Welfare(π) = i U i (π) Fairness(π) = i U i (π) π welfare/fairness-optimal if maximum Welfare(π)/Fairness(π) Any welfare/fairness-optimal π is also Pareto-optimal! (Why?) S. Albrecht, P. Stone 39

77 No-Regret Given history H t = (a 0, a 1,..., a t 1 ), agent i s regret for not having taken action a i is t 1 R i (a i H t ) = u i (a i, a τ i ) u i(a τ i, aτ i ) τ=0 S. Albrecht, P. Stone 40

78 No-Regret Given history H t = (a 0, a 1,..., a t 1 ), agent i s regret for not having taken action a i is t 1 R i (a i H t ) = u i (a i, a τ i ) u i(a τ i, aτ i ) τ=0 Policy π i achieves no-regret if (Other variants exist) 1 a i : lim t t R i(a i H t ) 0 S. Albrecht, P. Stone 40

79 No-Regret Like Nash equilibrium, no-regret widely used in multiagent learning But, like NE, definition of regret has conceptual issues S. Albrecht, P. Stone 41

80 No-Regret Like Nash equilibrium, no-regret widely used in multiagent learning But, like NE, definition of regret has conceptual issues Regret definition assumes other agents don t change actions t 1 R i (a i H t ) = u i (a i, a τ i ) u i(a τ i, aτ i ) τ=0 But: entire history may change if different actions taken! S. Albrecht, P. Stone 41

81 No-Regret Like Nash equilibrium, no-regret widely used in multiagent learning But, like NE, definition of regret has conceptual issues Regret definition assumes other agents don t change actions t 1 R i (a i H t ) = u i (a i, a τ i ) u i(a τ i, aτ i ) τ=0 But: entire history may change if different actions taken! Thus, minimising regret not generally same as maximising utility (e.g. Crandall, 2014) S. Albrecht, P. Stone 41

82 Targeted Optimality & Safety Many algorithms designed to achieve some version of targeted optimality and safety: S. Albrecht, P. Stone 42

83 Targeted Optimality & Safety Many algorithms designed to achieve some version of targeted optimality and safety: If other agent s policy π j in certain class, agent i s learning should converge to best-response U i (π i, π j ) max U i (π π i, π j) i S. Albrecht, P. Stone 42

84 Targeted Optimality & Safety Many algorithms designed to achieve some version of targeted optimality and safety: If other agent s policy π j in certain class, agent i s learning should converge to best-response U i (π i, π j ) max U i (π π i, π j) i If not in class, learning should at least achieve safety (maximin) utility U i (π i, π j ) max π i min U i (π π i, π j ) j S. Albrecht, P. Stone 42

85 Targeted Optimality & Safety Many algorithms designed to achieve some version of targeted optimality and safety: If other agent s policy π j in certain class, agent i s learning should converge to best-response U i (π i, π j ) max U i (π π i, π j) i If not in class, learning should at least achieve safety (maximin) utility U i (π i, π j ) max π i min U i (π π i, π j ) j Policy classes: non-learning, memory-bounded, finite automata,... S. Albrecht, P. Stone 42

86 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 43

87 Learning Algorithms The Internal View How does learning take place in policy π i? The internal view: π i continually modifies internal policy ˆπ i t based on H t ˆπ i t has own representation and input format Ĥ t π i H t Modify ˆπ t 1 i ˆπ i t Extract H t Ĥ t Ĥ t ˆπt i a t i S. Albrecht, P. Stone 44

88 Learning Algorithms The Internal View Internal policy ˆπ i t: Representation: Q-learning, MCTS planner, neural network,... Parameters: Q-table, opponent model, connection weights,... Input format: most recent state/action, abstract feature vector,... π i H t Modify ˆπ t 1 i ˆπ i t Extract H t Ĥ t Ĥ t ˆπt i a t i S. Albrecht, P. Stone 45

89 Fictitious Play Simple example: Fictitious Play (FP) (Brown, 1951) At each time t: S. Albrecht, P. Stone 46

90 Fictitious Play Simple example: Fictitious Play (FP) (Brown, 1951) At each time t: 1. Compute opponent s action frequencies: P(a j ) = 1 t + 1 t [a τ j = a j ] 1 τ=0 S. Albrecht, P. Stone 46

91 Fictitious Play Simple example: Fictitious Play (FP) (Brown, 1951) At each time t: 1. Compute opponent s action frequencies: P(a j ) = 1 t + 1 t [a τ j = a j ] 1 τ=0 2. Compute best-response action: a t i arg max P(a j ) u i (a i, a j ) a i a j S. Albrecht, P. Stone 46

92 Fictitious Play Simple example: Fictitious Play (FP) (Brown, 1951) At each time t: 1. Compute opponent s action frequencies: P(a j ) = 1 t + 1 t [a τ j = a j ] 1 τ=0 2. Compute best-response action: a t i arg max P(a j ) u i (a i, a j ) a i a j Self-play: all agents use fictitious play If policies converge, policy profile is Nash equilibrium S. Albrecht, P. Stone 46

93 Learning Algorithms Many multiagent learning algorithms exist, e.g. Minimax-Q (Littman, 1994) JAL (Claus and Boutilier, 1998) Regret Matching (Hart and Mas-Colell, 2001, 2000) FFQ (Littman, 2001) WoLF-PHC (Bowling and Veloso, 2002) Nash-Q (Hu and Wellman, 2003) CE-Q (Greenwald and Hall, 2003) OAL (Wang and Sandholm, 2003) ReDVaLeR (Banerjee and Peng, 2004) GIGA-WoLF (Bowling, 2005) CJAL (Banerjee and Sen, 2007) AWESOME (Conitzer and Sandholm, 2007) CMLeS (Chakraborty and Stone, 2014) HBA (Albrecht, Crandall, and Ramamoorthy, 2016) S. Albrecht, P. Stone 47

94 (Conditional) Joint Action Learning Joint Action Learning (JAL) (Claus and Boutilier, 1998) and Conditional Joint Action Learning (CJAL) (Banerjee and Sen, 2007) learn Q-values for joint actions a A: Q t+1 (a t ) = (1 α)q t (a t ) + α u t i u t i is utility received after joint action a t α [0, 1] is learning rate S. Albrecht, P. Stone 48

95 (Conditional) Joint Action Learning Joint Action Learning (JAL) (Claus and Boutilier, 1998) and Conditional Joint Action Learning (CJAL) (Banerjee and Sen, 2007) learn Q-values for joint actions a A: Q t+1 (a t ) = (1 α)q t (a t ) + α u t i u t i is utility received after joint action a t α [0, 1] is learning rate Use opponent model to compute expected utilities of actions: JAL: E(a i ) = P(a j ) Q t+1 (a i, a j ) a j CJAL: E(a i ) = a j P(a j a i ) Q t+1 (a i, a j ) S. Albrecht, P. Stone 48

96 (Conditional) Joint Action Learning Opponent models estimated from history H t : JAL: CJAL: P(a j ) = 1 t + 1 P(a j a i ) = t [a τ j = a j ] 1 τ=0 t τ=0 [aτ j = a j, a τ i = a i ] 1 t τ=0 [aτ j = a j ] 1 S. Albrecht, P. Stone 49

97 (Conditional) Joint Action Learning Opponent models estimated from history H t : JAL: CJAL: P(a j ) = 1 t + 1 P(a j a i ) = t [a τ j = a j ] 1 τ=0 t τ=0 [aτ j = a j, a τ i = a i ] 1 t τ=0 [aτ j = a j ] 1 Given expected utilities E(a i ), use some action exploration scheme: E.g. ϵ-greedy: choose arg max ai E(a i ) with probability 1 ϵ, else choose random action S. Albrecht, P. Stone 49

98 (Conditional) Joint Action Learning JAL and CJAL can converge to Nash equilibrium in self-play CJAL in variant of Prisoner s Dilemma (from Banerjee and Sen, 2007): Converging to Pareto-optimal NE (C,C) S. Albrecht, P. Stone 50

99 Opponent Modelling FP, JAL, CJAL are simple examples of opponent modelling: Use model of other agent to predict its actions, goals, beliefs,... S. Albrecht, P. Stone 51

100 Opponent Modelling FP, JAL, CJAL are simple examples of opponent modelling: Use model of other agent to predict its actions, goals, beliefs,... Many forms of opponent modelling exist: Policy reconstruction Recursive reasoning Type-based methods Graphical methods Classification Group modelling Plan recognition... S. Albrecht, P. Stone 51

101 Opponent Modelling FP, JAL, CJAL are simple examples of opponent modelling: Use model of other agent to predict its actions, goals, beliefs,... Many forms of opponent modelling exist: Policy reconstruction Recursive reasoning Type-based methods Graphical methods Classification Group modelling Plan recognition... Upcoming survey by S. Albrecht & P. Stone! S. Albrecht, P. Stone 51

102 Minimax/Nash/Correlated Q-Learning Minimax Q-Learning (Minimax-Q) (Littman, 1994) and Nash Q-Learning (Nash-Q) (Hu and Wellman, 2003) and Correlated Q-Learning (CE-Q) (Greenwald and Hall, 2003) learn joint-action Q-values for each agent j N: [ ] Q t+1 j (s t, a t ) = (1 α)q t j (st, a t ) + α u t j + γeq j(s t+1 ) S. Albrecht, P. Stone 52

103 Minimax/Nash/Correlated Q-Learning Minimax Q-Learning (Minimax-Q) (Littman, 1994) and Nash Q-Learning (Nash-Q) (Hu and Wellman, 2003) and Correlated Q-Learning (CE-Q) (Greenwald and Hall, 2003) learn joint-action Q-values for each agent j N: [ ] Q t+1 j (s t, a t ) = (1 α)q t j (st, a t ) + α u t j + γeq j(s t+1 ) Assumes utilities u t j are commonly observed S. Albrecht, P. Stone 52

104 Minimax/Nash/Correlated Q-Learning Minimax Q-Learning (Minimax-Q) (Littman, 1994) and Nash Q-Learning (Nash-Q) (Hu and Wellman, 2003) and Correlated Q-Learning (CE-Q) (Greenwald and Hall, 2003) learn joint-action Q-values for each agent j N: [ ] Q t+1 j (s t, a t ) = (1 α)q t j (st, a t ) + α u t j + γeq j(s t+1 ) Assumes utilities u t j are commonly observed EQ(s t+1 ) is expected utility to agent j under equilibrium profile for normal-form game with utility functions u j (a) = Q t j (st+1, a) Minimax-Q: use minimax profile (assumes zero-sum game) Nash-Q: use Nash equilibrium CE-Q: use correlated equilibrium S. Albrecht, P. Stone 52

105 Minimax/Nash/Correlated Q-Learning Minimax-Q, Nash-Q, CE-Q can converge to equilibrium in self-play E.g. Nash-Q formal proof of convergence to NE HU ANDWELLMAN But based on strong restrictions on Q t j! Assumption 3 One of the following conditions holds during learning. 3 Condition A. Every stage game (Q 1 t (s),...,q n t (s)), for all t and s, has a global optimal point, and agents payoffs in this equilibrium are used to update their Q-functions. Condition B. Every stage game (Q 1 t (s),...,q n t (s)), for all t and s, has a saddle point, and agents payoffs in this equilibrium are used to update their Q-functions. (HuWe andfurther Wellman, define the 2003) distance between two Q-functions. Definition 15 For Q, ˆQ 2 Q, define k Q ˆQ k max j max k Q j (s) s ˆQ j (s) k ( j,s) max max max Q j (s,a 1,...,a n ) ˆQ j (s,a 1,...,a n ). S. Albrecht, P. Stone j s a 1,...,a n 53

106 Model-Free Learning JAL, CJAL, Nash-Q,... learn models of other agents Model-based learning S. Albrecht, P. Stone 54

107 Model-Free Learning JAL, CJAL, Nash-Q,... learn models of other agents Model-based learning Can also learn without modelling other agents Model-free learning e.g. WoLF-PHC, Regret Matching S. Albrecht, P. Stone 54

108 Win or Learn Fast Policy Hill Climbing Win or Learn Fast Policy Hill Climbing (WoLF-PHC) (Bowling and Veloso, 2002) uses policy hill climbing in policy space: { δ if a ˆπ t+1 i (s t, a t i ) = ˆπt i (st, a t t i ) + i = arg max a Q(s t, a i i ) δ A i 1 else Q is standard Q-learning S. Albrecht, P. Stone 55

109 Win or Learn Fast Policy Hill Climbing Win or Learn Fast Policy Hill Climbing (WoLF-PHC) (Bowling and Veloso, 2002) uses policy hill climbing in policy space: { δ if a ˆπ t+1 i (s t, a t i ) = ˆπt i (st, a t t i ) + i = arg max a Q(s t, a i i ) δ A i 1 else Q is standard Q-learning Variable learning rate δ: { δ w if a δ = i ˆπ i t(st, a i ) Q(s t, a i ) > a i π i (s t, a i ) Q(s t, a i ) else δ l adapt slowly when winning, fast when losing (δ w < δ l ) π i is average policy over past policies ˆπ i S. Albrecht, P. Stone 55

110 Win or Learn Fast Policy Hill Climbing WoLF gradient ascent in self-play converges to Nash equilibrium in two-player, two-action repeated game (Bowling and Veloso, 2002) F-PHC: Pr(Heads) PHC: Pr(Heads) PHC WoLF Pr(Paper) e+06 rations nnies Game Pr(Rock) (b) Rock-Paper-Scissors Game pennies: the policy for one of the players as a probability distribution while learning with ayer s policy Targeted looks similar. optimality: (b) Results if opponent for rock-paper-scissors: policy converges, trajectories ofwolf-phc one player s in self-play, and the upper-right shows WoLF-PHC in self-play. converges to best-response against opponent S. Albrecht, P. Stone 56 Hence the game requires evaluated after every 50,000 steps. The worst performing pol-

111 Regret Matching Regret Matching (RegMat) (Hart and Mas-Colell, 2000) computes conditional regret for not choosing a i whenever a i was chosen: R(a i, a i ) = 1 t + 1 τ:a τ i = a i u i (a i, aτ j ) u i(a τ ) S. Albrecht, P. Stone 57

112 Regret Matching Regret Matching (RegMat) (Hart and Mas-Colell, 2000) computes conditional regret for not choosing a i whenever a i was chosen: R(a i, a i ) = 1 t + 1 τ:a τ i = a i u i (a i, aτ j ) u i(a τ ) Used to modify policy: ˆπ t+1 i (a i ) = 1 µ max[r(aτ i, a i), 0] a i a t i 1 a i aτ i ˆπ t+1 i (a i ) a i = a t i µ > 0 is inertia parameter S. Albrecht, P. Stone 57

113 Regret Matching RegMat converges to correlated equilibrium in self-play Assumes actions commonly observed and utility functions known S. Albrecht, P. Stone 58

114 Regret Matching RegMat converges to correlated equilibrium in self-play Assumes actions commonly observed and utility functions known Modified RegMat (Hart and Mas-Colell, 2001) removes assumptions only observe own action and utilities R(a i, a i ) = 1 t + 1 τ:a τ i = a i (plus modified policy normalisation) ˆπ τ i (a i) ˆπ τ i (a i )uτ i 1 t + 1 τ:a τ i = a i u τ i S. Albrecht, P. Stone 58

115 Regret Matching RegMat converges to correlated equilibrium in self-play Assumes actions commonly observed and utility functions known Modified RegMat (Hart and Mas-Colell, 2001) removes assumptions only observe own action and utilities R(a i, a i ) = 1 t + 1 τ:a τ i = a i (plus modified policy normalisation) ˆπ τ i (a i) ˆπ τ i (a i )uτ i 1 t + 1 τ:a τ i = a i u τ i Also converges to correlated equilibrium in self-play! S. Albrecht, P. Stone 58

116 Learning in Mixed Groups Bonus question: How do algorithms perform in mixed groups? Empirical study by Albrecht and Ramamoorthy (2012): Tested 5 algorithms in mixed groups: JAL, CJAL, Nash-Q, WoLF-PHC, Modified RegMat S. Albrecht, P. Stone 59

117 Learning in Mixed Groups Bonus question: How do algorithms perform in mixed groups? Empirical study by Albrecht and Ramamoorthy (2012): Tested 5 algorithms in mixed groups: JAL, CJAL, Nash-Q, WoLF-PHC, Modified RegMat Tested in all (78) structurally distinct, strictly ordinal 2 2 repeated games (Rapoport and Guyer, 1966), e.g. 1,2 2,4 4,1 3,3 S. Albrecht, P. Stone 59

118 Learning in Mixed Groups Bonus question: How do algorithms perform in mixed groups? Empirical study by Albrecht and Ramamoorthy (2012): Tested 5 algorithms in mixed groups: JAL, CJAL, Nash-Q, WoLF-PHC, Modified RegMat Tested in all (78) structurally distinct, strictly ordinal 2 2 repeated games (Rapoport and Guyer, 1966), e.g. 1,2 2,4 4,1 3,3 Also tested in 500 random strictly ordinal (3 agents) repeated games S. Albrecht, P. Stone 59

119 Learning in Mixed Groups Test criteria: Convergence rate Final expected utilities Social welfare/fairness Solution rates: Nash equilibrium (NE) Pareto-optimality (PO) Welfare-optimality (WO) Fairness-optimality (FO) S. Albrecht, P. Stone 60

120 Learning in Mixed Groups Test criteria: Convergence rate Final expected utilities Social welfare/fairness Solution rates: Nash equilibrium (NE) Pareto-optimality (PO) Welfare-optimality (WO) Fairness-optimality (FO) Which algorithm is best? S. Albrecht, P. Stone 60

121 Learning in Mixed Groups 100% 95% 90% Overall results JAL CJAL WoLF PHC RegMat Nash-Q 85% 80% 75% 70% Conv. Exp. Util. NE PO WO FO 100% is maximum possible utility/rate Difficulty: NE < PO < WO < FO S. Albrecht, P. Stone 61

122 Learning in Mixed Groups 100% 95% 90% Overall results JAL CJAL WoLF PHC RegMat Nash-Q 85% 80% 75% 70% Conv. Exp. Util. NE PO WO FO Answer: No clear winner! S. Albrecht, P. Stone 61

123 Overview Introduction Multiagent Models & Assumptions Learning Goals Learning Algorithms Recent Trends S. Albrecht, P. Stone 62

124 Teamwork Typical approach: Whole team designed and trained by single organisation Agents share coordination protocols, communication languages, domain knowledge, algorithms,... S. Albrecht, P. Stone 63

125 Teamwork Typical approach: Whole team designed and trained by single organisation Agents share coordination protocols, communication languages, domain knowledge, algorithms,... Pre-coordination! S. Albrecht, P. Stone 63

126 Ad Hoc Teamwork What if pre-coordination not possible? S. Albrecht, P. Stone 64

127 Ad Hoc Teamwork What if pre-coordination not possible? Forming temporary teams on the fly S. Albrecht, P. Stone 64

128 Ad Hoc Teamwork What if pre-coordination not possible? Forming temporary teams on the fly Agents designed by different organisations S. Albrecht, P. Stone 64

129 Ad Hoc Teamwork What if pre-coordination not possible? Forming temporary teams on the fly Agents designed by different organisations Don t speak same language, no knowledge of other agents capabilities, different beliefs,... S. Albrecht, P. Stone 64

130 Ad Hoc Teamwork What if pre-coordination not possible? Forming temporary teams on the fly Agents designed by different organisations Don t speak same language, no knowledge of other agents capabilities, different beliefs,... Challenge: Ad Hoc Teamwork (Stone et al., 2010) Create an autonomous agent that is able to efficiently and robustly collaborate with previously unknown teammates on tasks to which they are all individually capable of contributing as team members. S. Albrecht, P. Stone 64

131 RoboCup Drop-In Competition RoboCup SPL Drop-In Competition 13, 14, 15 (Genter et al., 2017) Mixed players from different teams No prior coordination between players Video: Drop-In Competition S. Albrecht, P. Stone 65

132 Learning in Ad Hoc Teamwork Many learning algorithms not suitable for ad hoc teamwork: S. Albrecht, P. Stone 66

133 Learning in Ad Hoc Teamwork Many learning algorithms not suitable for ad hoc teamwork: RL-based algorithms (JAL, CJAL, Nash-Q,...) need 1000 s of iterations in simple games Ad hoc teamwork: not much time for learning, trial & error,... S. Albrecht, P. Stone 66

134 Learning in Ad Hoc Teamwork Many learning algorithms not suitable for ad hoc teamwork: RL-based algorithms (JAL, CJAL, Nash-Q,...) need 1000 s of iterations in simple games Ad hoc teamwork: not much time for learning, trial & error,... Many algorithms designed for self-play (all agents use same algorithm) Ad hoc teamwork: no control over other agents S. Albrecht, P. Stone 66

135 Learning in Ad Hoc Teamwork Many learning algorithms not suitable for ad hoc teamwork: RL-based algorithms (JAL, CJAL, Nash-Q,...) need 1000 s of iterations in simple games Ad hoc teamwork: not much time for learning, trial & error,... Many algorithms designed for self-play (all agents use same algorithm) Ad hoc teamwork: no control over other agents Need method which can learn quickly to interact effectively with unknown other agents! S. Albrecht, P. Stone 66

136 Type-Based Method Hypothesise possible types of other agents: Each type θ j is blackbox behaviour specification: P(a j H t, θ j ) S. Albrecht, P. Stone 67

137 Type-Based Method Hypothesise possible types of other agents: Each type θ j is blackbox behaviour specification: P(a j H t, θ j ) Generate types from e.g. experience from past interactions domain and task knowledge learn new types online (opponent modelling) S. Albrecht, P. Stone 67

138 Type-Based Method During the interaction: Compute belief over types based on interaction history H t : P(θ j H t ) P(H t θ j ) P(θ j ) Plan own action based on beliefs S. Albrecht, P. Stone 68

139 Type-Based Method Planning Harsanyi-Bellman Ad Hoc Coordination (HBA) (Albrecht et al., 2016) π i (H t, a i ) arg max a i E a i s t (H t ) E a i s (Ĥ) = P(θ j Ĥ) P(a j Ĥ, θ j ) Q (a i,a j ) s (Ĥ) θ j a j Q a s(ĥ) = [ ) T(s, a, s ) u i (s, a) + γ max E a i a s ( Ĥ, ] a, s i s S. Albrecht, P. Stone 69

140 Type-Based Method Planning Harsanyi-Bellman Ad Hoc Coordination (HBA) (Albrecht et al., 2016) π i (H t, a i ) arg max a i E a i s t (H t ) E a i s (Ĥ) = P(θ j Ĥ) P(a j Ĥ, θ j ) Q (a i,a j ) s (Ĥ) θ j a j Q a s(ĥ) = [ ) T(s, a, s ) u i (s, a) + γ max E a i a s ( Ĥ, ] a, s i s Optimal planning with built-in exploration: Value of Information S. Albrecht, P. Stone 69

141 Type-Based Method Planning Can compute E a i s with finite tree-expansion: Unfold tree of future trajectories with fixed depth Associate each trajectory with probability and utility Calculate expected utility of action by traversing to root S. Albrecht, P. Stone 70

142 Type-Based Method Planning Can compute E a i s with finite tree-expansion: Unfold tree of future trajectories with fixed depth Associate each trajectory with probability and utility Calculate expected utility of action by traversing to root Inefficient: exponential in states, actions, agents S. Albrecht, P. Stone 70

143 Type-Based Method Planning Use Monte-Carlo Tree Search (MCTS) for efficient approximation: Repeat x times: 1. Sample type θ j Θ j with probabilities P(θ j H t ) 2. Sample interaction trajectory using θ j and domain model T 3. Update utility estimates via backprop on trajectory E.g. Albrecht and Stone (2017), Barrett et al. (2011) S. Albrecht, P. Stone 71

144 Type-Based Method Planning Use Monte-Carlo Tree Search (MCTS) for efficient approximation: Repeat x times: 1. Sample type θ j Θ j with probabilities P(θ j H t ) 2. Sample interaction trajectory using θ j and domain model T 3. Update utility estimates via backprop on trajectory E.g. Albrecht and Stone (2017), Barrett et al. (2011) But: loses value of information! (no belief change during planning) S. Albrecht, P. Stone 71

145 Ad Hoc Teamwork: Predator Pursuit 4 predators must capture 1 prey in grid world (Barrett et al., 2011) We control one agent in predator team Policies of other predators unknown (prey moves randomly) 4 types provided to our agent; online planning using MCTS Video: 4 types, true type inside Video: 4 types, true type outside (students) Source: S. Albrecht, P. Stone 72

146 Ad Hoc Teamwork: Half Field Offense 4 offense players vs. 5 defense players (Barrett and Stone, 2015) We control one agent (green) in offensive team (yellow) Policies of teammates unknown (defense uses fixed policies) 7 team types provided to our agent; for each team type, plan own policy offline using RL Video: 4v5 Half Field Offense Source: S. Albrecht, P. Stone 73

147 Learning Parameters in Types We can learn more: parameters in types! (Albrecht and Stone, 2017) P(a i H t, θ j, p) p = (p 1,..., p k ) continuous parameter vector Complex types can have several parameters learning rate, exploration rate, discount factor,... S. Albrecht, P. Stone 74

148 Learning Parameters in Types We can learn more: parameters in types! (Albrecht and Stone, 2017) P(a i H t, θ j, p) p = (p 1,..., p k ) continuous parameter vector Complex types can have several parameters learning rate, exploration rate, discount factor,... Goal: simultaneously learn type and parameters in type S. Albrecht, P. Stone 74

149 Learning Parameters in Types For each type θ j Θ j, maintain parameter estimate p [p min, p max ] Observe action a t j of agent j Select types Φ Θ j for updating For each θ j Φ, update estimate p t p t+1 Plan own action Update beliefs: P(θ j H t+1 ) P(a t j Ht, θ j, p t+1 ) P(θ j H t ) S. Albrecht, P. Stone 75

150 Updating Parameter Estimates Given type θ j, update parameter estimate p t p t+1 P(a 2 j H2,θ j,p 1,p 2 ) Type defines action likelihoods P(a t j Ht, θ j, p) P(a 1 j H1,θ j,p 1,p 2 ) P(a 0 j H0,θ j,p 1,p 2 ) -5 0 p p S. Albrecht, P. Stone 76

151 Updating Parameter Estimates Bayesian updating: Approximate P(a t j Ht, θ j, p) as polynomial with variables p Perform conjugate updates through successive layers P(a 2 j H2,θ j,p 1,p 2 ) P(a 1 j H1,θ j,p 1,p 2 ) P(a 0 j H0,θ j,p 1,p 2 ) -5 0 p p S. Albrecht, P. Stone 77

152 Updating Parameter Estimates Bayesian updating: Approximate P(a t j Ht, θ j, p) as polynomial with variables p Perform conjugate updates through successive layers P(a 2 j H2,θ j,p 1,p 2 ) P(a 1 j H1,θ j,p 1,p 2 ) Global optimisation: arg max p t+1 τ=1 P(a τ 1 j H τ 1, θ j, p) Solve with Bayesian optimisation P(a 0 j H0,θ j,p 1,p 2 ) -5 0 p p (Albrecht and Stone, 2017) S. Albrecht, P. Stone 77

153 Ad Hoc Teamwork: Level-Based Foraging Blue = our agent, red = other agents Goal: collect all items in minimal time Agents can collect item if sum of agent levels item level S. Albrecht, P. Stone 78

154 Ad Hoc Teamwork: Level-Based Foraging Blue = our agent, red = other agents Goal: collect all items in minimal time Agents can collect item if sum of agent levels item level possible types for red, e.g. search for item, try to load search for agent, load item closest to agent S. Albrecht, P. Stone 78

155 Ad Hoc Teamwork: Level-Based Foraging Blue = our agent, red = other agents Goal: collect all items in minimal time Agents can collect item if sum of agent levels item level possible types for red, e.g. search for item, try to load search for agent, load item closest to agent Each type uses 3 parameters: skill level, view radius, view angle S. Albrecht, P. Stone 78

156 Ad Hoc Teamwork: Level-Based Foraging Blue = our agent, red = other agents Goal: collect all items in minimal time Agents can collect item if sum of agent levels item level possible types for red, e.g. search for item, try to load search for agent, load item closest to agent Each type uses 3 parameters: skill level, view radius, view angle Blue doesn t know true type of red nor parameter values of type S. Albrecht, P. Stone 78

157 Ad Hoc Teamwork: Level-Based Foraging Video: 15x15 world, 3 agents Video: 10x10 world, 2 agents S. Albrecht, P. Stone 79

158 Type-Based Method & Ad Hoc Teamwork AAAI 16 Tutorial on Type-Based Methods: Special Issue on Multiagent Interaction without Prior Coordination (MIPC): MIPC Workshops: AAMAS 17, Sao Paulo, Brazil AAAI 16, Phoenix, Arizona, USA AAAI 15, Austin, Texas, USA AAAI 14, Quebec City, Canada S. Albrecht, P. Stone 80

159 Deep Reinforcement Learning Standard Q-learning assumes tabular representation: One entry in Q for each (s, a) S. Albrecht, P. Stone 81

160 Deep Reinforcement Learning Standard Q-learning assumes tabular representation: One entry in Q for each (s, a) Does not scale to complex domains! S. Albrecht, P. Stone 81

161 Deep Reinforcement Learning Standard Q-learning assumes tabular representation: One entry in Q for each (s, a) Does not scale to complex domains! Does not generalise values! S. Albrecht, P. Stone 81

162 Deep Reinforcement Learning Standard Q-learning assumes tabular representation: One entry in Q for each (s, a) Does not scale to complex domains! Does not generalise values! Needs extra engineering to work, including: State abstraction to reduce state space (usually hand-coded & domain-specific) Function approximation to store and generalise Q (e.g. linear function approximation in state features) S. Albrecht, P. Stone 81

163 Deep Reinforcement Learning New problem: extra engineering may limit performance! State abstraction may be wrong (e.g. too coarse) Function approximator may be inaccurate S. Albrecht, P. Stone 82

164 Deep Reinforcement Learning New problem: extra engineering may limit performance! State abstraction may be wrong (e.g. too coarse) Function approximator may be inaccurate Idea: deep reinforcement learning Use deep neural network to represent Q Learn on raw data (no state abstraction) Let network learn good abstraction on its own! S. Albrecht, P. Stone 82

165 Deep Reinforcement Learning Deep learning: neural network with many layers Input layer takes raw data s Hidden layers transform data Output layer returns target scalars Q(s, ) Train network with back-propagation on labelled data S. Albrecht, P. Stone 83

166 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights S. Albrecht, P. Stone 84

167 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t S. Albrecht, P. Stone 84

168 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t 2. With probability ϵ, select random action a t Else, select action a t arg max a Q(s t, a; Ψ) S. Albrecht, P. Stone 84

169 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t 2. With probability ϵ, select random action a t Else, select action a t arg max a Q(s t, a; Ψ) 3. Get reward r t and new state s t+1 S. Albrecht, P. Stone 84

170 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t 2. With probability ϵ, select random action a t Else, select action a t arg max a Q(s t, a; Ψ) 3. Get reward r t and new state s t+1 4. Store experience (s t, a t, r t, s t+1 ) in D S. Albrecht, P. Stone 84

171 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t 2. With probability ϵ, select random action a t Else, select action a t arg max a Q(s t, a; Ψ) 3. Get reward r t and new state s t+1 4. Store experience (s t, a t, r t, s t+1 ) in D 5. Sample random minibatch D + D S. Albrecht, P. Stone 84

172 Deep Q-Learning Deep Q-Learning (Mnih et al., 2013) Initialise network parameters Ψ with random weights 1. Observe current state s t 2. With probability ϵ, select random action a t Else, select action a t arg max a Q(s t, a; Ψ) 3. Get reward r t and new state s t+1 4. Store experience (s t, a t, r t, s t+1 ) in D 5. Sample random minibatch D + D 6. For each (s τ, a τ, r τ, s τ+1 ) D +, perform gradient descent step on (y τ Q(s τ, a τ ; Ψ)) 2 y τ = r τ + γ max a Q(s τ, a ; Ψ fixed ) S. Albrecht, P. Stone 84

173 Multiagent Deep Reinforcement Learning Deep RL very successful at many singe-agent games e.g. Atari games, Go, 3D maze navigation,... Can we use Deep RL for multiagent learning? Problem: learning of other agents makes environment non-stationary (breaks Markov property) S. Albrecht, P. Stone 85

174 Independent Deep Q-Learners Video: Cooperative Pong Video: Competitive Pong (Tampuu et al., 2017) Video: Gathering game Video: Wolfpack game (Leibo et al., 2017) Video: Starcraft (Foerster et al., 2017) S. Albrecht, P. Stone 86

Multiagent (Deep) Reinforcement Learning

Multiagent (Deep) Reinforcement Learning Multiagent (Deep) Reinforcement Learning MARTIN PILÁT (MARTIN.PILAT@MFF.CUNI.CZ) Reinforcement learning The agent needs to learn to perform tasks in environment No prior knowledge about the effects of

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference

Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Convergence to Pareto Optimality in General Sum Games via Learning Opponent s Preference Dipyaman Banerjee Department of Math & CS University of Tulsa Tulsa, OK, USA dipyaman@gmail.com Sandip Sen Department

More information

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Jacob W. Crandall Michael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 USA crandall@cs.byu.edu

More information

Multiagent Learning Using a Variable Learning Rate

Multiagent Learning Using a Variable Learning Rate Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

Normal-form games. Vincent Conitzer

Normal-form games. Vincent Conitzer Normal-form games Vincent Conitzer conitzer@cs.duke.edu 2/3 of the average game Everyone writes down a number between 0 and 100 Person closest to 2/3 of the average wins Example: A says 50 B says 10 C

More information

Learning Near-Pareto-Optimal Conventions in Polynomial Time

Learning Near-Pareto-Optimal Conventions in Polynomial Time Learning Near-Pareto-Optimal Conventions in Polynomial Time Xiaofeng Wang ECE Department Carnegie Mellon University Pittsburgh, PA 15213 xiaofeng@andrew.cmu.edu Tuomas Sandholm CS Department Carnegie Mellon

More information

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Ronen I. Brafman Department of Computer Science Stanford University Stanford, CA 94305 brafman@cs.stanford.edu Moshe Tennenholtz

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Game-Theoretic Learning:

Game-Theoretic Learning: Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

arxiv: v3 [cs.ai] 2 Mar 2016

arxiv: v3 [cs.ai] 2 Mar 2016 Belief and Truth in Hypothesised Behaviours Stefano V. Albrecht a, Jacob W. Crandall b, Subramanian Ramamoorthy c a The University of Texas at Austin, United States b Masdar Institute of Science and Technology,

More information

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Robust Learning Equilibrium

Robust Learning Equilibrium Robust Learning Equilibrium Itai Ashlagi Dov Monderer Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000, Israel Abstract We introduce

More information

Progress in Learning 3 vs. 2 Keepaway

Progress in Learning 3 vs. 2 Keepaway Progress in Learning 3 vs. 2 Keepaway Gregory Kuhlmann Department of Computer Sciences University of Texas at Austin Joint work with Peter Stone RoboCup and Reinforcement Learning Reinforcement Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Belief-based Learning

Belief-based Learning Belief-based Learning Algorithmic Game Theory Marcello Restelli Lecture Outline Introdutcion to multi-agent learning Belief-based learning Cournot adjustment Fictitious play Bayesian learning Equilibrium

More information

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti

Game Theory, Evolutionary Dynamics, and Multi-Agent Learning. Prof. Nicola Gatti Game Theory, Evolutionary Dynamics, and Multi-Agent Learning Prof. Nicola Gatti (nicola.gatti@polimi.it) Game theory Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions

More information

First Prev Next Last Go Back Full Screen Close Quit. Game Theory. Giorgio Fagiolo

First Prev Next Last Go Back Full Screen Close Quit. Game Theory. Giorgio Fagiolo Game Theory Giorgio Fagiolo giorgio.fagiolo@univr.it https://mail.sssup.it/ fagiolo/welcome.html Academic Year 2005-2006 University of Verona Summary 1. Why Game Theory? 2. Cooperative vs. Noncooperative

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer

More information

Best-Response Multiagent Learning in Non-Stationary Environments

Best-Response Multiagent Learning in Non-Stationary Environments Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Journal of Artificial Intelligence Research 33 (28) 52-549 Submitted 6/8; published 2/8 A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Sherief Abdallah Faculty of Informatics The

More information

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

David Silver, Google DeepMind

David Silver, Google DeepMind Tutorial: Deep Reinforcement Learning David Silver, Google DeepMind Outline Introduction to Deep Learning Introduction to Reinforcement Learning Value-Based Deep RL Policy-Based Deep RL Model-Based Deep

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das Online Learning 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das About this class Goal To introduce the general setting of online learning. To describe an online version of the RLS algorithm

More information

AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents

AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer conitzer@cs.cmu.edu Computer Science Department, Carnegie

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Introduction to Game Theory

Introduction to Game Theory COMP323 Introduction to Computational Game Theory Introduction to Game Theory Paul G. Spirakis Department of Computer Science University of Liverpool Paul G. Spirakis (U. Liverpool) Introduction to Game

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Deep Reinforcement Learning STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017 Outline Introduction to Reinforcement Learning AlphaGo (Deep RL for Computer Go)

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6

Solving Zero-Sum Extensive-Form Games. Branislav Bošanský AE4M36MAS, Fall 2013, Lecture 6 Solving Zero-Sum Extensive-Form Games ranislav ošanský E4M36MS, Fall 2013, Lecture 6 Imperfect Information EFGs States Players 1 2 Information Set ctions Utility Solving II Zero-Sum EFG with perfect recall

More information

Gradient descent for symmetric and asymmetric multiagent reinforcement learning

Gradient descent for symmetric and asymmetric multiagent reinforcement learning Web Intelligence and Agent Systems: An international journal 3 (25) 17 3 17 IOS Press Gradient descent for symmetric and asymmetric multiagent reinforcement learning Ville Könönen Neural Networks Research

More information

Basic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria

Basic Game Theory. Kate Larson. January 7, University of Waterloo. Kate Larson. What is Game Theory? Normal Form Games. Computing Equilibria Basic Game Theory University of Waterloo January 7, 2013 Outline 1 2 3 What is game theory? The study of games! Bluffing in poker What move to make in chess How to play Rock-Scissors-Paper Also study of

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

CSC304 Lecture 5. Game Theory : Zero-Sum Games, The Minimax Theorem. CSC304 - Nisarg Shah 1

CSC304 Lecture 5. Game Theory : Zero-Sum Games, The Minimax Theorem. CSC304 - Nisarg Shah 1 CSC304 Lecture 5 Game Theory : Zero-Sum Games, The Minimax Theorem CSC304 - Nisarg Shah 1 Recap Last lecture Cost-sharing games o Price of anarchy (PoA) can be n o Price of stability (PoS) is O(log n)

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 50 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining

Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining Jeffrey L. Stimpson & ichael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 jstim,mike@cs.byu.edu

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 7, JULY Yujing Hu, Yang Gao, Member, IEEE, andboan,member, IEEE

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 7, JULY Yujing Hu, Yang Gao, Member, IEEE, andboan,member, IEEE IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 7, JULY 2015 1289 Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer Yujing Hu, Yang Gao, Member, IEEE, andboan,member, IEEE Abstract

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Quantum Games. Quantum Strategies in Classical Games. Presented by Yaniv Carmeli

Quantum Games. Quantum Strategies in Classical Games. Presented by Yaniv Carmeli Quantum Games Quantum Strategies in Classical Games Presented by Yaniv Carmeli 1 Talk Outline Introduction Game Theory Why quantum games? PQ Games PQ penny flip 2x2 Games Quantum strategies 2 Game Theory

More information

Notes on Coursera s Game Theory

Notes on Coursera s Game Theory Notes on Coursera s Game Theory Manoel Horta Ribeiro Week 01: Introduction and Overview Game theory is about self interested agents interacting within a specific set of rules. Self-Interested Agents have

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Leading a Best-Response Teammate in an Ad Hoc Team

Leading a Best-Response Teammate in an Ad Hoc Team In AAMAS 2009 workshop on Agent Mediated Electronic Commerce (AMEC 2009), May 2009. Leading a Best-Response Teammate in an Ad Hoc Team Peter Stone 1, Gal A. Kaminka 2, and Jeffrey S. Rosenschein 3 1 Dept.

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

A (Brief) Introduction to Game Theory

A (Brief) Introduction to Game Theory A (Brief) Introduction to Game Theory Johanne Cohen PRiSM/CNRS, Versailles, France. Goal Goal is a Nash equilibrium. Today The game of Chicken Definitions Nash Equilibrium Rock-paper-scissors Game Mixed

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Cooperating with a Markovian Ad Hoc Teammate

Cooperating with a Markovian Ad Hoc Teammate In the Proceedings of the 2th International Conference on Autonomous Agents and Multiagent Systems, Saint Paul, Minnesota, USA, 203. Cooperating with a Markovian Ad Hoc eammate Doran Chakraborty Microsoft

More information

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about: Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed

More information

Learning an Effective Strategy in a Multi-Agent System with Hidden Information

Learning an Effective Strategy in a Multi-Agent System with Hidden Information Learning an Effective Strategy in a Multi-Agent System with Hidden Information Richard Mealing Supervisor: Jon Shapiro Machine Learning and Optimisation Group School of Computer Science University of Manchester

More information

Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :

Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i : On the Connection between No-Regret Learning, Fictitious Play, & Nash Equilibrium Amy Greenwald Brown University Gunes Ercal, David Gondek, Amir Jafari July, Games A game is a tuple = (I; (S i ;r i ) i2i)

More information

Counterfactual Multi-Agent Policy Gradients

Counterfactual Multi-Agent Policy Gradients Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science University of Oxford joint work with Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, and Nantas Nardelli July

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information