COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Similar documents
Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Grundlagen der Künstlichen Intelligenz

Basics of reinforcement learning

Reinforcement Learning and Control

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning

Markov Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning

Reinforcement Learning Active Learning

Artificial Intelligence

Reinforcement Learning. George Konidaris

Lecture 23: Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 1: March 7, 2018

Evaluation of multi armed bandit algorithms and empirical algorithm

Introduction to Reinforcement Learning

1 MDP Value Iteration Algorithm

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

6 Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Grundlagen der Künstlichen Intelligenz

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Temporal Difference Learning & Policy Iteration

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Reinforcement learning an introduction

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

REINFORCEMENT LEARNING

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CSC242: Intro to AI. Lecture 23

Approximate Universal Artificial Intelligence

CSC321 Lecture 22: Q-Learning

CS 4100 // artificial intelligence. Recap/midterm review!

Reinforcement Learning. Introduction

Markov decision processes

Grundlagen der Künstlichen Intelligenz

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS 188: Artificial Intelligence

Reinforcement Learning: An Introduction

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Machine Learning I Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Autonomous Helicopter Flight via Reinforcement Learning

Active Learning and Optimized Information Gathering

Markov Decision Processes (and a small amount of reinforcement learning)

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Decision Theory: Q-Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning. Machine Learning, Fall 2010

15-780: ReinforcementLearning

Reinforcement Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

CSE250A Fall 12: Discussion Week 9

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

16.4 Multiattribute Utility Functions

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Solving Dynamic Bandit Problems and Decentralized Games using the Kalman Bayesian Learning Automaton

Reinforcement learning

Factored State Spaces 3/2/178

Bandit models: a tutorial

Decision Theory: Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Reinforcement Learning Part 2

CS 188: Artificial Intelligence Spring Announcements

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

CS 380: ARTIFICIAL INTELLIGENCE

Trust Region Policy Optimization

An Introduction to Reinforcement Learning

Exploration. 2015/10/12 John Schulman

Artificial Intelligence & Sequential Decision Problems

CS230: Lecture 9 Deep Reinforcement Learning

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Sequential Decision Problems

Lecture 3: Markov Decision Processes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Transcription:

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati

Today } What is machine learning? } Where is it used? } Types of machine learning algorithms. } Supervised learning } Unsupervised learning } Reinforcement learning

What is machine learning? } Algorithms to enable agents improve their behavior with experience. } Improve: There s a performance measure. } Experience: Feedback / Observations the agents perceived. } Components to consider: } Performance measure. } Representation: Bayes Network, real functions, etc. } Types of feedback: Win/lose at the end of a game, immediate outcome, etc.

Where is it used? } Lots of places Games: Go, Dota, etc. Legal: Trademark, logo, etc. (e.g., https://trademark.vision ) Medical: Prosthesis, reduce tremor in surgeons, etc. Many more

Types of machine learning } Supervised learning: } Learn from examples. } Sometimes, we re not sure if the examples are correct or not. } Often called semi-supervised learning. } Becoming more popular with crowd sourcing (e.g., mechanical turk) } Unsupervised learning: } Learn by finding structure in data. } Reinforcement learning: } Learn by doing. Require training data

Reinforcement Learning } What is Reinforcement Learning? } Methods for solving

An Illustration of A Reinforcement Learning Agent

Formally, } A reinforcement learning is an MDP where the transition and/or reward functions are not known } S: State space. } A: Action space. } T: transition function. T(s, a, s ) = P(S t+1 = s S t = s, A t = a). } R: Reward function. R(s, a). } Need to try & explore the environment At least one of them are not known

Solving? } Solving a Reinforcement Learning Problem means: Computing the best action to perform (recall the MDP definition of best action), even thought the transition & reward function are not known a priori. } In general, solving boils down to solving exploration vs exploitation problem } Should the agent explore less known states and actions that might generate better values OR } Should the agent exploit states and actions that it has tried before and has definitely generate good values.

Exploration vs Exploitation } Simplest: Multi-arm bandit } Simplest in the sense assume only 1 state } We ve touched about this a bit when we try to combine samplers in motion planning, when we combine sampling strategies in sampling-based motion planning and when we need to choose which sampling strategy to use for tree expansion in MCTS } Epsilon-greedy } EXP3 } UCB

Epsilon-greedy } Assign a weight to each sampling strategy. } Start with equal weight for all strategies. } Strategy with the highest weight is selected with probability (1-ε). The rest are selected with probability ε/n, where N is #strategies available. } Suppose strategy s 1 is selected, we ll use s 1 to sample and add a vertex and edges to the roadmap. If the addition connects disconnected components of the roadmap OR adds #connected components of the roadmap, increment the utility of s 1 by 1.

EXP3 } At least competitive to the best strategy. p Sampling history s ws ( t) = (1 η ) + w ( t) samplers s s η K p s 0

Upper Confidence Bound } Choose an action a to perform at s as: Exploitation Exploration c: A constant indicating how to balance exploration & exploitation, need to be decided by trial & error. n(s): #times node s has been visited. n(s,a): #times the out-edge of s with label a has been visited.

Exploration vs Exploitation } Simplest: Multi-arm bandit } Simplest in the sense assume only 1 state } We ve touched about this a bit when we try to combine samplers in motion planning, when we combine sampling strategies in sampling-based motion planning and when we need to choose which sampling strategy to use for tree expansion in MCTS } Epsilon-greedy } EXP3 } UCB

More general approaches for solving RL } Data from interacting with the world: <s, a, r, s > } Model-based vs model-free: What s being learned? } Passive vs Active: How the data are being generated? Model-based Model-free Passive Active

Model-based vs model-free } Model-based } Use data to learn the missing components of the MDP problem, i.e.: T & R } Once we know T & R, solve the MDP problem } Indirect learning, but most efficient use of data } Model-free } Use data to learn the value function & policy directly. } Direct learning, not the most efficient use of data, but sometimes can be fast

Passive vs Active } Passive } Fixed policy } The agent observes the world following the policy OR given a data set (e.g., from video), the agent observes to learn value function or the model (Transition and Reward functions) } Recently introduced } Active } The classical Reinforcement learning problem } The agent selects what action to perform, and the action performed determine the data it receives, which then determines how fast the agent converges to the correct MDP model. } Exploration vs exploitation } Combination of the two

Model-based Approach: Overview } Two steps, ran iteratively } Loop over: } Learning step: Use data to estimate T & R } Solve the MDP as if the learned model is correct. Use methods for solving MDP we ve discussed before

Model-based: Simple Frequentist } Learning the T & R steps: } Use counting to estimate T & R } T (s, a, s ) = #data where the agent ends up in s after performing a from s / #data where the agent perform a from s } R (s, a) has a finite & known range, we can do counting, same as learning T(s, a, s ) } R (s, a) infinite / unknown range, we can fit a function to the date (can use methods from regression/supervised learning) } This simple strategy will converge to the true values

Model-based, Passive 1. Estimate the transition T & reward R functions from data, can use (modified) supervised learning methods to compute. 2. Solve the MDP (generate an optimal policy) as if the learned model is correct, using methods for solving MDP as discussed before. 3. Compute the difference between the data and trajectories generated if this optimal policy is executed 4. If the difference is large, } Improve the MDP model based on the above difference } Repeat to 2

More general approaches for solving RL } Data from interacting with the world: <s, a, r, s > } Model-based vs model-free: What s being learned? } Passive vs Active: How the data are being generated? Model-based Model-free Passive Active

Model-based, Active } We need a way to decide which data to use } Classic: Interact with the world directly. } Decide the action we use to interact with the world, so as to balance gaining information and reaching the goal } Nowadays: Can perform the trials in high fidelity simulator } Decide the action we use to interact with the simulated world (of course the hope is the simulation is close to reality), so as to balance gaining information and reaching the goal } Need to consider how well transfers from simulator to the real world is.

Bayesian Reinforcement Learning } Bayesian view: } The parameters (T & R) we want to estimate are represented as random variables } Start with a prior over models } Compute posterior based on data } Quite useful when the agent actively gather data } Can decide how to balance exploration & exploitation or how to improve the model & solve the problem optimally } Often represented as Partially Observable Markov Decision Processes (POMDPs)

Bayesian Reinforcement Learning } The problem of finding solving MDP with unknown T & R can be represented as a POMDP with partially observed MDP model } POMDP model: } S: MDP states X T X R } A: MDP action } T(s, a, s ): The transition assuming the MDP model is as described by POMDP state s } Ω: The resulting next state and reward of the MDP } Z(s, a, o): Perceived next state & reward assuming the MDP model is as described by POMDP state s } R(s, a): The reward assuming the MDP model is as described by POMDP state s

Bayesian Reinforcement Learning } Optimal policy of the POMDP is optimal exploration vs exploitation } It will try to balance building the most accurate model & working directly towards achieving the goal. } Will make the MDP agent receives the maximum reward given the initially unknown T & R. } Building the best model is just an intermediate step, not the end goal!

More general approaches for solving RL } Data from interacting with the world: <s, a, r, s > } Model-based vs model-free: What s being learned? } Passive vs Active: How the data are being generated? Passive Active Model-based Model-free