Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Similar documents
Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning. George Konidaris

CS 570: Machine Learning Seminar. Fall 2016

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning. Machine Learning, Fall 2010

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Lecture 23: Reinforcement Learning

15-780: ReinforcementLearning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Introduction

ARTIFICIAL INTELLIGENCE. Reinforcement learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Grundlagen der Künstlichen Intelligenz

Lecture 10 - Planning under Uncertainty (III)

CS599 Lecture 1 Introduction To RL

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement learning an introduction

Lecture 1: March 7, 2018

Q-learning. Tambet Matiisen

Machine Learning I Reinforcement Learning

Introduction to Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning

QUICR-learning for Multi-Agent Coordination

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Some AI Planning Problems

Reinforcement Learning. Yishay Mansour Tel-Aviv University

RL 3: Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

(Deep) Reinforcement Learning

Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning: An Introduction

Reinforcement Learning and Control

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Basics of reinforcement learning

Reinforcement Learning and NLP

Reinforcement Learning

Reinforcement Learning

Decision Theory: Q-Learning

Multiagent (Deep) Reinforcement Learning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

REINFORCEMENT LEARNING

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

A Gentle Introduction to Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Internet Monetization

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Learning in State-Space Reinforcement Learning CIS 32

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Introduction of Reinforcement Learning

Lecture 8: Policy Gradient

COMP3702/7702 Artificial Intelligence Week1: Introduction Russell & Norvig ch.1-2.3, Hanna Kurniawati

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Temporal difference learning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Reinforcement Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Reinforcement Learning

Markov Decision Processes

Decision Theory: Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning and Deep Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Factored State Spaces 3/2/178

Temporal Difference Learning & Policy Iteration

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Machine Learning I Continuous Reinforcement Learning

1 [15 points] Search Strategies

Real Time Value Iteration and the State-Action Value Function

Lecture 7: Value Function Approximation

Q-Learning in Continuous State Action Spaces

Reinforcement Learning Part 2

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Final Exam December 12, 2017

Transcription:

Short Course: Multiagent Systems Lecture 1: Basics Agents Environments Reinforcement Learning Multiagent Systems This course is about: Agents: Sensing, reasoning, acting Multiagent Systems: Distributed control Dealing with systems with interactions o Cooperative Collectives Swarms o Competitive Game theory Auctions Doing Research o Literature review o Gap identification o Research o Results 1

Motivation: Why Study Multiagent Systems? Current trends: Systems are becoming more interconnected o Larger, more distributed, more stochastic Hybrid systems are emerging o Biological/nano/electronic systems?? Computation is entering new niches o more powerful,cheaper, smaller devices We need new approaches to optimization and control for: Thousands of components Failing components Dynamic and stochastic environments Hierarchical/hybrid systems Applications: Think Big and Small Controlling multiple autonomous vehicles Managing traffic congestion Routing data over a network Controlling constellations of satellites Coordinating thousands of simple devices Managing power distribution Stabilizing wings with tiny flaps Flying in formation Morphing matter: Smart structures Coordinating micro air vehicles Managing system health Controlling nano/micro devices Managing air traffic flow 2

Agents Two definitions An agent is a computer system that is capable of independent (autonomous) action. o Autonomous: figure out what needs to be done o How to select actions o How to evaluate actions An agent is a computer system that senses, reasons about and acts in an environment o Sensing o Reasoning o Acting o Environment Intelligence Complexity of tasks that we automate has grown A lot of what we take for granted today would have been viewed as AI ten years ago Autopilot of a 747? Deep Blue? Internet searches? Seems intelligence is something off in the distance but in reality it is in many everyday products My definition: An agent that senses the world, reasons about the world, acts within that world, and learns from its interaction with the world is an intelligence agent 3

Multiagent system Definition A multiagent is a system that consists of multiple agents that interact with one another and the environment. o Multiple o Agents o Interact o Environment Objections to multiagent system Isn t it just: Distributed systems? AI? Game theory? Economics/mechanism design? Social science? Biological/ecological modeling? 4

Environment Accessible vs. inaccessible Deterministic vs. non-deterministic Episodic vs. non-episodic Static vs. dynamic Discrete vs. continuous Properties of the environment Accessible vs. inaccessible Accessible: Agents can obtain complete, accurate, up to date information about the environment Most environments of interest are inaccessible o For example, anything operating in the real world 5

Properties of the environment Deterministic vs. non-deterministic Deterministic: an action taken by an agent has a predictable consequence. There is no uncertainty about the outcome of an action. Most environments of interest are non-deterministic o For example, most things operating in the real world Properties of the environment Episodic vs. non-episodic Episodic: The agent acts for a fixed number of time steps and then the world is reset. There is no link between the episodes. Non-episodic: The agents operates continuously in an environment. o Some real world problems are episodic: games o Other real world problems are non-episodic: exploring a terrain o Some problems can be viewed as either: Robot trying to find a target in a room. 6

Properties of the environment Static vs. dynamic Static: The environment stays the same except for changes caused by the agents actions o A robot aims to detect fixed goals in an arena Dynamic: The environment in which the agent operates changes o Goals that robot needs to detect appear/disappear/move Properties of the environment Discrete vs. continuous Discrete: There is fixed, finite number of actions o A game of chess is discrete (discrete does not mean easy ) Continuous: There is an infinite number of actions o Autonomous vehicle control is continuous (in the most general case) Direction angle Speed 7

Reactive Agents A reactive agent is one that interacts with a changing environment and responds to changes that occur in that environment Fixed responses Learning systems? Goal Directed Agents We design agents to do something. That something has to be expressed to the agent Utility/Objective/Reward/Payoff function 8

Goal Directed Agents We design agents to do something. That something has to be expressed to the agent Utility/Objective/Reward/Payoff function A goal directed agent is one that acts to achieve a specified goal Fixed responses? Learning systems? Deductive/Inductive Reasoning Deductive Reasoning Example: o Rover is a dog o All dogs are mammals o Rover is a mammal Inductive Reasoning Example: o Ellen is a student o Most students study o Ellen must study 9

Brooks and the Subsumption Architecture Rodney Brooks Three key statements: Intelligent behavior can be generated without explicit representations Intelligent behavior can be generated without explicit abstract reasoning Intelligent behavior is an emergent property of certain types of complex systems Two key ideas: Situatedness and embodiment: Real intelligence is situated in the world, not in disembodied systems such as theorem provers or expert systems Intelligence and emergence: Intelligent behavior arises as a result of an agent s interaction with its environment (from Wooldridge, An Introduction to Multiagent Systems, chap 5) Example: Bar Problem Congestion game: A game where agents share the same action space, and system objective is a function purely of how many agents take each action. Illustrative Example: Arthur s El Farol bar problem: At each time step, each agent decides whether to attend a bar: o If agent attends and bar is below capacity, agent gets reward o If agent stays home and bar is above capacity, agent gets reward Problem is particularly interesting because rational agents cannot all correctly predict attendance: o If most agents predict attendance will be low and therefore attend, attendance will be high o If most agents predict high attendance and therefore do not attend 10

Example: Bar Problem Bar owner has a utility function that it wants to maximize: For each day: o Maximal revenue if bar is at capacity c o If more crowded than c, revenue drops off and extra costs kick in o If less crowded than c, not enough revenue Bar owner wants to maximize total revenue for week Problem is part of congestion games where the value of a resource (night at bar) depends on number of people using it Bar --> highway ; night --> lane ; barkeep --> city manager Bar --> server farm ; night --> server; barkeep --> IT support Modified El Farol Bar Problem Each week agents select one of N nights to attend a bar G(z) = N k=1 x k (z) c x k (z) e Reward for night k Attendance for night k Capacity of bar G: Reward for all nights Further modifications: Each week each agent selects two nights to attend bar.... Each week each agent selects six nights to attend bar. 11

Properties of agents Input but no state (purely reactive) No state information Decision based solely on present input o Bar problem: go same day as Joe and Jane State but no input (purely history based) Build internal state based on past observation Decision based on past successes, but no input o Bar problem: go the day I got the best reward, last n weeks State and input Build internal state based on past observation Use additional input to predict likely outcomes o Bar problem: Go on day I got most reward that best matches what Joe and Jane are doing this time step Focus on two types of agents: Learning Agents Reinforcement learning agents o Learning automata o Action value o Value iteration o Policy iteration o Temporal difference learning o Q-learning Neuro-control agents o Neural network maps inputs to outputs o Weights adapt to minimize an objective function Need a teacher Need a search algorithm (simulated annealing, Evolutionary Algorithms) 12

Reinforcement Learning Concept Learn from interactions with the environment Take action Receive feedback from the environment Modify your behavior Achieve some goal Examples: Baby playing o Connection to the environment guides sensory input/output Driving a car o Learn from interaction Agent Environment Interaction in RL s t r t-1 Agent a t r t s t+1 Environment 13

Agent Environment Interaction in RL s t r t-1 Agent a t r t s t+1 Environment s t a t s t +1,r t Policy, π t : map states to probabilities of taking actions: π t (s,a) = P(a t = a s t = s) Learning Agents: RL Simple Reinforcement Learner (Action Value) for agent: o Agent has N actions o Agent has a Value V k associated with each action a k o At each time step: Agent takes action a k (for the n th time) with probability p k p k = eβ V k i e β V i Agent receives reward R and updates Value function: n V n (a k ) (1 α)v n 1 (a k ) + αr ak 14

Value Updates and Temporal Difference Learning Value update: V (s t ) V (s t ) + α( r(s t ) V (s t )) Value Updates and Temporal Difference Learning Value update: V (s t ) V (s t ) + α( r(s t ) V (s t )) r(s t ) + γv (s t +1 ) Estimate of Reward Temporal difference learning Actual Reward V (s t ) V (s t ) + α( ( r(s t ) + γv(s t +1 )) V(s t )) 15

Sarsa Learning Q values for state-action pairs: Sarsa learning is temporal difference extended to s,a: Q(s t ) Q(s t ) + α( r(s t ) + γ Q(s t +1 +1 ) Q(s t )) Actual Reward Value of next state-action pair NewEstimate OldEstimate + Stepsize (Target OldEstimate) Q-Learning What if we update Q values without using policy? Q-Learning: ( ) Q(s t ) Q(s t ) + α r(s t ) + γ maxq(s? t +1,a ) Q(s t ) a Actual Reward Estimate of reward 16

Q-Learning What if we update Q values without using policy? Q-Learning: ( ) Q(s t ) Q(s t ) + α r(s t ) + γ maxq(s t +1,a ) Q(s t ) a Actual Reward Best possible value of Next state-action pair Policy independent Q update is based on best possible move Q update does not depend on action taken Learning Agents: Neural Networks Simple Neural Network for agent: o o o o Agent has N actions Agent has to map a set of observations (other agent actions, past history) to an action. Use teacher to learn the weights At teach time step: Take action Compare result to teacher s suggested action Update weights so resulting action is closer to teacher Use search algorithm to learn the weight At each time step: 1. Start with initial random networks 2. Select a network (90% best, 10% random) 3. Perturb the weights (mutation) 4. Use network to select action, 5. Evaluate system performance 6. Drop worst network from pool, goto 2. 17

Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance R 18

Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance 6. Re-insert network into pool 7. Remove worst network from pool R Neuro-Control 1. At t=0 initialize N neural networks 2. Pick a network using ε greedy alg (ε=.1) 3. Randomly modify network parameters 4. Use network on this agent for T>>t steps 5. Evaluate network performance 6. Re-insert network into pool 7. Remove worst network from pool R 8. Go to step 2 19