ARTIFICIAL INTELLIGENCE. Reinforcement learning

Similar documents
CS599 Lecture 1 Introduction To RL

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning. Up until now we have been

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning (1)

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

CS 188: Artificial Intelligence Spring Announcements

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement learning an introduction

Basics of reinforcement learning

Grundlagen der Künstlichen Intelligenz

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reinforcement Learning

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning. George Konidaris

The Markov Decision Process (MDP) model

Reinforcement Learning. Machine Learning, Fall 2010

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

Reinforcement Learning II

Reinforcement Learning

Machine Learning I Reinforcement Learning

15-780: ReinforcementLearning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS 570: Machine Learning Seminar. Fall 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning Wrap-up

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning: An Introduction

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning Active Learning

Lecture 1: March 7, 2018

Decision Theory: Q-Learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Q-learning. Tambet Matiisen

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Probabilistic Planning. George Konidaris

Reinforcement Learning and Control

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Open Theoretical Questions in Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

An Introduction to Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Introduction to Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Q-Learning in Continuous State Action Spaces

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Elements of Reinforcement Learning

Temporal difference learning

6 Reinforcement Learning

Reinforcement learning

Factored State Spaces 3/2/178

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

RL 3: Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Notes on Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CS 598 Statistical Reinforcement Learning. Nan Jiang

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Temporal Difference Learning & Policy Iteration

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning Part 2

Markov Decision Processes

CS 4100 // artificial intelligence. Recap/midterm review!

Grundlagen der Künstlichen Intelligenz

A Gentle Introduction to Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning. Introduction

Transcription:

INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

Outline Reinforcement learning basics Relation with MDPs Model-based and model-free learning Exploitation vs. exploration (Approximate Q-learning) 2

Reinforcement learning RL methods are employed to address two related problems: the Prediction Problem and the Control Problem. Prediction: learn value function for a (fixed) policy and use that to predict reward for future actions. Control: learn, by interacting with the environment, a policy which maximizes the reward when traveling through state space obtain an optimal policy which allows for action planning and optimal control. 3

Examples of Reinforcement Learning Robocup Soccer Teams (Stone & Veloso, Reidmiller et al.) World s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management (Van Roy, Bertsekas, Lee & Tsitsiklis) 10-15% improvement over industry standard methods Dynamic Channel Assignment (Singh & Bertsekas, Nie & Haykin) World's best assigner of radio channels to mobile telephone calls Elevator Control (Crites & Barto) (Probably) world's best down-peak elevator controller Many Robots navigation, bi-pedal walking, grasping, switching between skills... Games: TD-Gammon, Jellyfish (Tesauro, Dahl), AlphaGo (Deepmind) World's best backgammon & Go players (Alpha Go: https://www.youtube.com/watch?v=subqykxvx0a) 5

Key Features of RL Agent learns by interacting with environment Agent learns from the consequences of its actions, rather than from being explicitly taught, by receiving a reinforcement signal Because of chance, agent has to try things repeatedly Agent makes mistakes, even if it learns intelligently (regret) Agent selects its actions based on its past experiences (exploitation) and also on new choices (exploration) trial and error learning Possibly sacrifices short-term gains for larger long-term gains 6

Reinforcement Learning: idea Agent State: s Reward: r Actions: a Environment Basic idea: Receive feedback in the form of rewards Agent s return in long run is defined by the reward function Must (learn to) act so as to maximize expected return All learning is based on observed samples of outcomes! 8

The Agent-Environment Interface Agent: Interacts with environment at time t Observes state at step t: s t S Produces action at step t: a t A(s t ) Gets resulting reward: r t 1 R And resulting next state: s t 1 0, 1, 2,... r t +1 s t +1 r t +2 s t +2 r t +3 s t +3... s t a t a t +1 a t +2 a t +3 9

RL as MDP The best studied case is when RL can be formulated as a (finite) Markov Decision Process (MDP), i.e. we assume: A (finite) set of states s S A set of actions (per state) A A model T(s,a,s ) A reward function R(s,a,s ) Markov assumption Still looking for a policy (s) New twist: we don t know T or R! I.e. we don t know which states are good or what the actions do Must actually try actions and states out to learn 11

An Example: Recycling robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if it runs out of power while searching, it has to be rescued (which is bad). Actions are chosen based on current energy level (states): high, low. Reward = number of cans collected 12

Recycling Robot MDP S high, low A(high) search, wait A(low) search, wait, recharge R R search wait expected no. of cans while searching expected no. of cans while waiting R search R wait 13

MDPs and RL Known MDP: Offline Solution, no learning Goal Compute V π Compute V*, * Technique Policy evaluation Value / policy iteration Unknown MDP: Model-Based Unknown MDP: Model-Free Goal Technique Compute V*, * VI/PI on approximated MDP Goal Technique Compute V π Direct evaluation TD-learning Compute Q*, * Q-learning 14

Model-Based Learning Model-Based Idea: Learn an approximate model based on experiences Solve for values, as if the learned model were correct Step 1: Learn empirical MDP model Count outcomes s for each s, a Normalize to give an estimate of Discover each when we experience (s, a, s ) Step 2: Solve the learned MDP For example, use value iteration, as before: 15

Model-Free Learning Model-Free idea: Directly learn (approximate) state values, based on experiences Methods (a.o.): I. Direct evaluation II. Temporal difference learning III. Q-learning Passive: use fixed policy Active: off-policy Remember: this is NOT offline planning! You actually take actions in the world. 16

I: Direct Evaluation Goal: Compute V(s) under given Idea: Average reward to go of visits 1. First act according to for several episodes/epochs 2. Afterwards, for every state s and every time t that s is visited: determine the rewards r t r subsequently received in epoch 3. Sample for s at time t = sum of discounted future rewards sample = R t s = r t + γr t+1 s (R s = r ) given experience tuples <s, (s), r t, s > 4. Average samples over all visits of s Note: this is the simplest Monte Carlo method 17

Example: Direct Evaluation Input: Policy A B C D E States Assume: = 1 Observed Episodes (Training) Episode 1 Episode 2 B, east, -1, C C, east, -1, D D, exit, +10, B, east, -1, C C, east, -1, D D, exit, +10, Episode 3 Episode 4 E, north, -1, C C, east, -1, D D, exit, +10, E, north, -1, C C, east, -1, A A, exit, -10, Output Values -10 A +8 +4 +10 B C D E -2 18

Properties of Direct Evaluation Benefits: easy to understand doesn t require any knowledge of T, R eventually computes the correct average values, using just sample transitions Drawbacks: wastes information about state connections each state must be learned separately takes a long time to learn Output Values -10 A +8 +4 +10 B C D E -2 If B and E both go to C under this policy, how can their values be different? 20

II: Temporal Difference Learning Goal: Compute V(s) under given Big idea: update after every experience! Likely outcomes will contribute updates more often s Temporal difference learning of values s (s) 1. Initialize each V(s) with some value 2. Observe experience tuple <s, (s), r, s > 3. Use observation in rough estimate of long-term reward V(s) sample s = r + γ V π (s ) 4. Update V(s) by moving values slightly towards estimate: V π (s) V π (s) + α (sample s V π (s)) where 0 α 1 is the learning rate. 21

Example: TD- Learning Input: Policy A A B C D B C D E E States Assume: 0 0 0 8 0 init Each V(s) can be initialised with an arbitrary value. Reward function is unknown; but perhaps we do know that we receive a reward of 8 after ending up in D this can be exploited. = 1, α = 1/2 22

Example: TD- Learning Input: Policy A A Experience <s,π(s),r,s > B, east, -2, C 0 0 sample(b): 2 + γ 0 = 2 B C D B C D E E States Assume: = 1, α = 1/2 0 0 8 0 init -1 0 8 0 Update V π (B): 1 α 0 + α sample(b) V π (s) V π (s) + α (sample s V π (s)) = (1 α)v π (s) + α sample s = (1 α)v π (s) + α(r + γ V π (s )) 23

Example: TD- Learning Input: Policy A A 0 Experienced <s,π(s),r,s > C, east, -2, D 0 0 B C D B C D E E States Assume: = 1, α = 1/2 0 0 8-1 0 8-1 3 8 0 0 0 init sample(c): 2 + γ 8 = 6 Update V π (C): 1 α 0 + α sample(c) 24

Properties of TD Value Learning Benefits: Model free Bellman updates: connections between states used Updates upon each action Drawback: Values are learnt per policy Good for policy evaluation Long way from establishing optimal policy (Note that same holds for Direct evaluation) 25

26 Golf example: how valuable is a state? State is ball location Reward of 1 for each stroke until the ball is in the hole Actions: putt (use putter) driver (use driver) putt succeeds anywhere on the green Value of a state??

Optimal quantities revisited State s has value V(s): V * (s) = expected reward starting in s and acting optimally a state s q-state (s,a) has value Q(s,a): Q * (s,a) = expected reward having taken action a from state s and (thereafter) acting optimally q-state The optimal policy: * (s) = optimal action from state s state s

Bellman equation revisited Recall the Bellman equation for the optimal value function: V * ( s) max T ( s, a, s') a s' * * R( s, a, s') V ( s') maxq ( s, a) a Now, since also Q * V * ( s') maxq a' ( s', a'), we have that * ( s, a) T ( s, a, s') R( s, a, s') max Q ( s', a' ) a' s' The optimal policy now directly (no look-ahead) follows with argmax: * * ( s) arg maxq a * ( s, a) 28

Gridworld: V and Q values Noise = 0.2 Discount γ = 0.9 Living reward R(s) = 0 Optimal policy? 29

III: Q-Learning Idea: do Q-value updates to each q-state (like VI): But: can t compute this update without knowing T, R Instead: incorporate estimates as we go (like TD) 1. Initialize Q(s,a) = 0 for each s,a pair 2. Select action a and observe experience <s, a, r, s > 3. Use observation in rough estimate of Q(s, a): sample( s, a) r max a' Q( s', a') 4. Update Q(s,a) by moving values slightly towards estimate: Q( s, a) Q( s, a) sample( s, a) Q( s, a) (1 ) Q( s, a) sample( s, a) 30

Optimal Q -Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q(s,driver) gives the value or using driver first, then using whichever actions are best 31

Updating Q -values: example Current Q(s,a) indicated; experience <s 1,a right, 0,s 2 > sample( s1, aright ) r max Q( s2 a', a') 0 0.9 max{63,81,100} γ = 0.9 α = 1 90 Q( s1, aright ) (1 ) Q( s1, aright ) sample( s1, a (1 ) 72 90 right ) 32

Q-Learning Properties I Q-learning is off-policy learning if rewards 0 then Q -values 0 and non-decreasing with each update If each (s,a) pair is visited infinitely often, the process convergences to true (optimal) Q Amazing result: Q-learning converges to optimal policy -- even if you re acting suboptimally! Basically, in the limit, it doesn t matter how you select actions (!) 33

Q-Learning Properties II Caveats: You have to explore enough You have to eventually make the learning rate α small enough but not decrease it too quickly 34

Exploration vs. Exploitation Multi-armed bandit: each machine provides a random reward from a distribution specific to that machine. Which machine should you play, and how many times? 35

Exploration vs Exploitation The policy indicates the exploration strategy: which action to take in which state Standard Q-learning uses Q-values associated with best action: pure exploitation, using what it already knows We can add randomness for true exploration: sometimes try to learn something new by picking a random action (e.g. -greedy) The exploration-exploitation trade-off is highly influenced by context: online or offline? 36

Q-learning to crawl 37

Approximate Q-Learning 38

Generalizing Across States Basic Q-Learning keeps a table of all q-values In realistic situations, we cannot possibly learn about every single state! Too many states to visit them all in training Too many states to hold the q-tables in memory Instead, we want to generalize: Learn about some small number of training states from experience Generalize that experience to new, similar situations This is a fundamental idea in machine learning! 39

Example: Pacman Let s say we discover through experience that this state is bad: In naïve Q-learning, we know nothing about this state: Or even this one! 40

Feature-Based Representations Solution: describe a state using a vector of features (properties) Features are functions from states to real numbers (often 0/1) that capture important properties of the state Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1 / (dist to dot) 2 Is Pacman in a tunnel? (0/1) etc. Is it the exact state on this slide? Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 41

Linear Value Functions Using a feature representation, we can write a Q or V function for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value! 42

Approximate Q-Learning In Q-learning, use difference between current Q(s,a) and new sample to update weights of active features: transition = <s,a,r,s > sample before (exact Q): now: approximate Q with w updates Intuitive interpretation: if something unexpectedly bad happens, blame the features that were on: disprefer all states with that state s features 43

Example: Q-Pacman (no noise) 44

Summary Reinforcement learning: learn from experience, not from a teacher Reinforcement learning problem can be cast as MDP with unknown T and R Model-based RL: estimate R and T from experience Model-free RL: estimate V(s) or Q(s,a) from experience Latter can be done actively, using Q-learning, and gives optimal policy Large state-spaces: use approximate Q-learning with domain-specific features 46