Q-learning. Tambet Matiisen

Similar documents
1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Lecture 10 - Planning under Uncertainty (III)

REINFORCEMENT LEARNING

6 Reinforcement Learning

Reinforcement Learning Part 2

Human-level control through deep reinforcement. Liia Butler

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Basics of reinforcement learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning. Machine Learning, Fall 2010

Q-Learning in Continuous State Action Spaces

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CS 570: Machine Learning Seminar. Fall 2016

Grundlagen der Künstlichen Intelligenz

(Deep) Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement learning an introduction

Reinforcement Learning. George Konidaris

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

ARTIFICIAL INTELLIGENCE. Reinforcement learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning: An Introduction

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning I Reinforcement Learning

Decision Theory: Q-Learning

Reinforcement Learning. Introduction

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Grundlagen der Künstlichen Intelligenz

Introduction to Reinforcement Learning

Approximation Methods in Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning Active Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Temporal difference learning

Reinforcement Learning

Reinforcement Learning II

CS230: Lecture 9 Deep Reinforcement Learning

15-780: ReinforcementLearning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

The Reinforcement Learning Problem

Reinforcement Learning

Lecture 8: Policy Gradient

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

David Silver, Google DeepMind

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

RL 3: Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

16.410/413 Principles of Autonomy and Decision Making

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning

Reinforcement Learning and NLP

CS 188: Artificial Intelligence

Deep Reinforcement Learning: Policy Gradients and Q-Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Lecture 1: March 7, 2018

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov decision processes

Reinforcement Learning: the basics

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Internet Monetization

Multiagent (Deep) Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Reinforcement Learning

Deep Reinforcement Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning for NLP

Chapter 3: The Reinforcement Learning Problem

Notes on Reinforcement Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

, and rewards and transition matrices as shown below:

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning as Variational Inference: Two Recent Approaches

Grundlagen der Künstlichen Intelligenz

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Transcription:

Q-learning Tambet Matiisen (based on chapter 11.3 of online book Artificial Intelligence, foundations of computational agents by David Poole and Alan Mackworth)

Stochastic gradient descent Experience replay mechanism Convolutional neural networks Multilayer perceptrons Deep learning Restricted Boltzmann machines Q-learning Temporal difference learning Reinforcement learning Markov Decision Processes Arcarde Learning Environment

Reinforcement learning... is an area of machine learning inspired by behaviorist psychology, concerned with how to take actions in an environment so as to maximize cumulative reward. The agent learns on go, it has no prior knowledge of environment or which actions result in rewards. The agent must still be hardwired to recognize the reward.

Markov Decision Processes S is a finite set of states. A is a finite set of actions. MDP: <s 0,a 0,r 1,s 1,a 1,r 2,s 2,a 2,r 3,s 3,a 3,r 4,s 4...> P(s,a,s ) is the probability, that action a in state s will lead to state s. R(s,a,s ) is the immediate reward received after transition to state s from state s with action a. In reinforcement learning P(s,a,s ) and R(s,a,s ) are unknown to the agent.

Example 1 States: s 0, s 1, s 2, s 3, s 4, s 5 Actions: right left up (0.8 up, 0.1 left, 0.1 right) upc ( up carefully, always up, but costs reward -1) Additional rewards: Out of bounds -1

Example 2 State: <X, Y, P, D> X, Y location of the agent P location of the price (0-4) D if agent is damaged Actions: left, right, up, down not reliable Rewards: Price +10 Out of bounds -1 Monster bites damaged agent - 10

Why is it hard? Blame attribution problem Explore-exploit dilemma

Temporal Differences A k =(v 1 +...+v k )/k ka k = v 1 +...+v k-1 +v k = (k-1) A k-1 + v k A k = (1-1/k)A k-1 + v k /k = A k-1 + 1/k(v k - A k-1 ) Let α k = 1/k, then A k = A k-1 + α k (v k - A k-1 ) v k - A k-1 is called temporal difference error If α is a constant (0 < α 1) that does not depend on k, then latter values are weighted more. Convergence quaranteed if k=1.. α k = and k=1.. (α k ) 2 <

Discounted reward History: <s 0,a 0,r 1,s 1,a 1,r 2,s 2,a 2,r 3,s 3,a 3,r 4,s 4...> Total reward: R = t=1.. γ t-1 r t = r 1 + γr 2 + γ 2 r 3 +... = = r 1 + γ(r 2 + γ(r 3 +...)) γ is the discount factor (0 γ < 1), that makes the future rewards worth less than current. Total reward from time point t onward: R t = r t + γr t+1

Q-learning definition Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy. Q-learning uses temporal differences to estimate the value of Q*(s,a). The agent maintains a table of Q[S, A], where S is the set of states and A is the set of actions. Q[s, a] represents its current estimate of Q*(s,a).

Q-learning update rule Experience <s,a,r,s > provides a new data point to estimate Q*(s,a): r + γmax a Q*(s,a ). This is called return. The agent can use temporal difference equation to update its estimate for Q*(s,a): Q[s,a] Q[s,a] + α(r + γmax a Q[s,a ] - Q[s,a]) A k = A k-1 + α k (v k A k-1 ) R t = r t + γr t+1

Q-learning algorithm controller Q-learning(S,A,γ,α) 2: Inputs 3: S is a set of states 4: A is a set of actions 5: γ the discount 6: α is the step size 7: Local 8: real array Q[S,A] 9: previous state s 10: previous action a 11: initialize Q[S,A] arbitrarily 12: observe current state s 13: repeat 14: select and carry out an action a 15: observe reward r and state s' 16: Q[s,a] Q[s,a] + α(r+ γmax a' Q[s',a'] - Q[s,a]) 17: s s' until termination

Q-learning example s a r s' Update s 0 upc -1 s 2 Q[s 0,upC]=-0.2 s 2 up 0 s 4 Q[s 2,up]=0 s 4 left 10 s 0 Q[s 4,left]=2.0 s 0 upc -1 s 2 Q[s 0,upC]=-0.36 s 2 up 0 s 4 Q[s 2,up]=0.36 s 4 left 10 s 0 Q[s 4,left]=3.6 s 0 up 0 s 2 Q[s 0,upC]=0.06 s 2 up -100 s 2 Q[s 2,up]=-19.65 s 2 up 0 s 4 Q[s 2,up]=-15.07 s 4 left 10 s 0 Q[s 4,left]=4.89 γ=0.9 and α=0.2

Demo time!!!

Exploration vs Exploitation ε-greedy strategy: choose random action with ε probability otherwise choose action with maximum Q[s,a] soft-max action selection: choose action a with a probability depending on the value of Q[s,a], e.g. using Boltzmann distribution: (e Q[s,a]/T )/( a e Q[s,a]/T ), where T is the temperature specifying how randomly values should be chosen. "optimism in the face of uncertainty": initialize the Q-function to values that encourage exploration.

Evaluation of Reinforcement Learning Algorithms

Off-policy vs On-policy An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learning is an off-policy learner. An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps. It will find a policy that is optimal, taking into account the exploration inherent in the policy.

On-policy example: SARSA SARSA experience: <s,a,r,s,a > a next action chosen by policy based on Q SARSA update rule: Q[s,a] Q[s,a] + α(r + γq[s,a ] - Q[s,a]) SARSA is useful when you want to optimize the value of an agent that is exploring. If you want to do offline learning, and then use that policy in an agent that does not explore, Q- learning may be more appropriate.

Assigning Credit and Blame to Paths In Q-learning and SARSA, only the previous state-action pair has its value revised when a reward is received. How to update path: look-ahead or backup update n steps ahead or back. model-based methods the agent learns P(s,a,s ) and R(s,a,s ), uses them to update other Q-values.

Reinforcement Learning with Features Usually, there are too many states to reason about explicitly. The alternative is to reason in terms of features. Q-function approximations: linear combination of features, decision tree, neural network. Feature engineering - choosing what features to include - is non-trivial.

DeepMind States: last 4 images of 84x84, 128 graytones State-space size: 128 4x84x84 ~ 10 59473 Features: convolutional neural network Input: 4x84x84 image 3 hidden layers Output: Q-value for each action Actions: 4-18, depending on a game Rewards: difference in score Hardwired, also terminal states hardwired Learning: off-policy