Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed control Dealing with systems with interactions o Cooperative Collectives Swarms o Competitive Game theory Auctions Doing Research o Literature review o Gap identification o Research o Results 1
Motivation: Why Study Multiagent Systems? Current trends: Systems are becoming more interconnected o Larger, more distributed, more stochastic Hybrid systems are emerging o Biological/nano/electronic systems?? Computation is entering new niches o more powerful,cheaper, smaller devices We need new approaches to optimization and control for: Thousands of components Failing components Dynamic and stochastic environments Hierarchical/hybrid systems Applications: Think Big and Small Controlling multiple autonomous vehicles Managing traffic congestion Routing data over a network Controlling constellations of satellites Coordinating thousands of simple devices Managing power distribution Stabilizing wings with tiny flaps Flying in formation Morphing matter: Smart structures Coordinating micro air vehicles Managing system health Controlling nano/micro devices Managing air traffic flow 2
Agents Two definitions An agent is a computer system that is capable of independent (autonomous) action. o Autonomous: figure out what needs to be done o How to select actions o How to evaluate actions An agent is a computer system that senses, reasons about and acts in an environment o Sensing o Reasoning o Acting o Environment Intelligence Complexity of tasks that we automate has grown A lot of what we take for granted today would have been viewed as AI ten years ago Autopilot of a 747? Deep Blue? Internet searches? Seems intelligence is something off in the distance but in reality it is in many everyday products My definition: An agent that senses the world, reasons about the world, acts within that world, and learns from its interaction with the world is an intelligence agent 3
Multiagent system Definition A multiagent is a system that consists of multiple agents that interact with one another and the environment. o Multiple o Agents o Interact o Environment Objections to multiagent system Isn t it just: Distributed systems? AI? Game theory? Economics/mechanism design? Social science? Biological/ecological modeling? 4
Environment Accessible vs. inaccessible Deterministic vs. non-deterministic Episodic vs. non-episodic Static vs. dynamic Discrete vs. continuous Properties of the environment Accessible vs. inaccessible Accessible: Agents can obtain complete, accurate, up to date information about the environment Most environments of interest are inaccessible o For example, anything operating in the real world 5
Properties of the environment Deterministic vs. non-deterministic Deterministic: an action taken by an agent has a predictable consequence. There is no uncertainty about the outcome of an action. Most environments of interest are non-deterministic o For example, most things operating in the real world Properties of the environment Episodic vs. non-episodic Episodic: The agent acts for a fixed number of time steps and then the world is reset. There is no link between the episodes. Non-episodic: The agents operates continuously in an environment. o Some real world problems are episodic: games o Other real world problems are non-episodic: exploring a terrain o Some problems can be viewed as either: Robot trying to find a target in a room. 6
Properties of the environment Static vs. dynamic Static: The environment stays the same except for changes caused by the agents actions o A robot aims to detect fixed goals in an arena Dynamic: The environment in which the agent operates changes o Goals that robot needs to detect appear/disappear/move Properties of the environment Discrete vs. continuous Discrete: There is fixed, finite number of actions o A game of chess is discrete (discrete does not mean easy ) Continuous: There is an infinite number of actions o Autonomous vehicle control is continuous (in the most general case) Direction angle Speed 7
Reactive Agents A reactive agent is one that interacts with a changing environment and responds to changes that occur in that environment Fixed responses Learning systems? Goal Directed Agents We design agents to do something. That something has to be expressed to the agent Utility/Objective/Reward/Payoff function 8
Goal Directed Agents We design agents to do something. That something has to be expressed to the agent Utility/Objective/Reward/Payoff function A goal directed agent is one that acts to achieve a specified goal Fixed responses? Learning systems? Deductive/Inductive Reasoning Deductive Reasoning Example: o Rover is a dog o All dogs are mammals o Rover is a mammal Inductive Reasoning Example: o Ellen is a student o Most students study o Ellen must study 9
Brooks and the Subsumption Architecture Rodney Brooks Three key statements: Intelligent behavior can be generated without explicit representations Intelligent behavior can be generated without explicit abstract reasoning Intelligent behavior is an emergent property of certain types of complex systems Two key ideas: Situatedness and embodiment: Real intelligence is situated in the world, not in disembodied systems such as theorem provers or expert systems Intelligence and emergence: Intelligent behavior arises as a result of an agent s interaction with its environment (from Wooldridge, An Introduction to Multiagent Systems, chap 5) Example: Bar Problem Congestion game: A game where agents share the same action space, and system objective is a function purely of how many agents take each action. Illustrative Example: Arthur s El Farol bar problem: At each time step, each agent decides whether to attend a bar: o If agent attends and bar is below capacity, agent gets reward o If agent stays home and bar is above capacity, agent gets reward Problem is particularly interesting because rational agents cannot all correctly predict attendance: o If most agents predict attendance will be low and therefore attend, attendance will be high o If most agents predict high attendance and therefore do not attend 10
Example: Bar Problem Bar owner has a utility function that it wants to maximize: For each day: o Maximal revenue if bar is at capacity c o If more crowded than c, revenue drops off and extra costs kick in o If less crowded than c, not enough revenue Bar owner wants to maximize total revenue for week Problem is part of congestion games where the value of a resource (night at bar) depends on number of people using it Bar --> highway ; night --> lane ; barkeep --> city manager Bar --> server farm ; night --> server; barkeep --> IT support Modified El Farol Bar Problem Each week agents select one of N nights to attend a bar G(z) = N k=1 x k (z) c x k (z) e Reward for night k Attendance for night k Capacity of bar G: Reward for all nights Further modifications: Each week each agent selects two nights to attend bar.... Each week each agent selects six nights to attend bar. 11
Properties of agents Input but no state (purely reactive) No state information Decision based solely on present input o Bar problem: go same day as Joe and Jane State but no input (purely history based) Build internal state based on past observation Decision based on past successes, but no input o Bar problem: go the day I got the best reward, last n weeks State and input Build internal state based on past observation Use additional input to predict likely outcomes o Bar problem: Go on day I got most reward that best matches what Joe and Jane are doing this time step Focus on two types of agents: Learning Agents Reinforcement learning agents o Learning automata o Action value o Value iteration o Policy iteration o Temporal difference learning o Q-learning Neuro-control agents o Neural network maps inputs to outputs o Weights adapt to minimize an objective function Need a teacher Need a search algorithm (simulated annealing, Evolutionary Algorithms) 12
Reinforcement Learning Concept Learn from interactions with the environment Take action Receive feedback from the environment Modify your behavior Achieve some goal Examples: Baby playing o Connection to the environment guides sensory input/output Driving a car o Learn from interaction Agent Environment Interaction in RL s t r t-1 Agent a t r t s t+1 Environment 13
Agent Environment Interaction in RL s t r t-1 Agent a t r t s t+1 Environment s t a t s t +1,r t Policy, π t : map states to probabilities of taking actions: π t (s,a) = P(a t = a s t = s) Learning Agents: RL Simple Reinforcement Learner (Action Value) for agent: o Agent has N actions o Agent has a Value V k associated with each action a k o At each time step: Agent takes action a k (for the n th time) with probability p k p k = eβ V k i e β V i Agent receives reward R and updates Value function: n V n (a k ) (1 α)v n 1 (a k ) + αr ak 14
Value Updates and Temporal Difference Learning Value update: V (s t ) V (s t ) + α( r(s t ) V (s t )) Value Updates and Temporal Difference Learning Value update: V (s t ) V (s t ) + α( r(s t ) V (s t )) r(s t ) + γv (s t +1 ) Estimate of Reward Temporal difference learning Actual Reward V (s t ) V (s t ) + α( ( r(s t ) + γv(s t +1 )) V(s t )) 15
Sarsa Learning Q values for state-action pairs: Sarsa learning is temporal difference extended to s,a: Q(s t ) Q(s t ) + α( r(s t ) + γ Q(s t +1 +1 ) Q(s t )) Actual Reward Value of next state-action pair NewEstimate OldEstimate + Stepsize (Target OldEstimate) Q-Learning What if we update Q values without using policy? Q-Learning: ( ) Q(s t ) Q(s t ) + α r(s t ) + γ maxq(s? t +1,a ) Q(s t ) a Actual Reward Estimate of reward 16
Q-Learning What if we update Q values without using policy? Q-Learning: ( ) Q(s t ) Q(s t ) + α r(s t ) + γ maxq(s t +1,a ) Q(s t ) a Actual Reward Best possible value of Next state-action pair Policy independent Q update is based on best possible move Q update does not depend on action taken Learning Agents: Neural Networks Simple Neural Network for agent: o o o o Agent has N actions Agent has to map a set of observations (other agent actions, past history) to an action. Use teacher to learn the weights At teach time step: Take action Compare result to teacher s suggested action Update weights so resulting action is closer to teacher Use search algorithm to learn the weight At each time step: 1. Start with initial random networks 2. Select a network (90% best, 10% random) 3. Perturb the weights (mutation) 4. Use network to select action, 5. Evaluate system performance 6. Drop worst network from pool, goto 2. 17
Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance R 18
Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance 6. Re-insert network into pool 7. Remove worst network from pool R Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance 6. Re-insert network into pool 7. Remove worst network from pool R 8. Go to step 2 19