Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Similar documents
Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning for NLP

Reinforcement Learning

REINFORCEMENT LEARNING

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CS 570: Machine Learning Seminar. Fall 2016

Lecture 23: Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning. George Konidaris

CS599 Lecture 1 Introduction To RL

Markov Decision Processes (and a small amount of reinforcement learning)

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Lecture 1: March 7, 2018

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris

Temporal Difference Learning & Policy Iteration

Decision Theory: Q-Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Open Theoretical Questions in Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 3: Markov Decision Processes

Machine Learning I Reinforcement Learning

Decision Theory: Markov Decision Processes

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning

CS 188: Artificial Intelligence

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning and Deep Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning: the basics

Reinforcement Learning. Introduction

Grundlagen der Künstlichen Intelligenz

Multiagent (Deep) Reinforcement Learning

Q-Learning for Markov Decision Processes*

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

15-780: ReinforcementLearning

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning (1)

Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Markov Decision Processes and Solving Finite Problems. February 8, 2017

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Temporal difference learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning Wrap-up

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement learning

Reinforcement Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning: An Introduction

Reinforcement learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement learning an introduction

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement Learning II

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning Part 2

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Q-Learning in Continuous State Action Spaces

CS230: Lecture 9 Deep Reinforcement Learning

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning and Control

(Deep) Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

arxiv: v1 [cs.ai] 5 Nov 2017

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Q-learning. Tambet Matiisen

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

CS 188: Artificial Intelligence Spring Announcements

An Introduction to Reinforcement Learning

David Silver, Google DeepMind

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Transcription:

Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning 1 / 24

Control Learning Control Learning Consider learning to choose actions, e.g., Roomba learning to dock on battery charger Learning to choose actions to optimize factory output Learning to play Backgammon Note several problem characteristics: Delayed reward Opportunity for active exploration Possibility that state only partially observable Possible need to learn multiple tasks with same sensors/effectors Machine Learning: Jordan Boyd-Graber UMD Machine Learning 2 / 24

Control Learning One Example: TD-Gammon Learn to play Backgammon Immediate reward +1 if win -1 if lose for all other states [Tesauro, 1995] Trained by playing 1.5 million games against itself Now approximately equal to best human player Machine Learning: Jordan Boyd-Graber UMD Machine Learning 3 / 24

Control Learning Reinforcement Learning Problem At each step t the agent: Executes action a t Receives observation o t Receives scalar reward r t The environment: Receives action a t Emits observation o t +1 Emits scalar reward r t +1 Machine Learning: Jordan Boyd-Graber UMD Machine Learning 4 / 24

Control Learning Reinforcement Learning Problem Agent State Reward Action Environment s a r s 1 a 1 r 1 s 2 a 2 r 2... Machine Learning: Jordan Boyd-Graber UMD Machine Learning 5 / 24

Control Learning Markov Decision Processes Assume finite set of states S set of actions A at each discrete time agent observes state st S and chooses action a t A then receives immediate reward rt and state changes to st +1 Markov assumption: st +1 = δ(s t,a t ) and r t = r (s t,a t ) i.e., r t and s t +1 depend only on current state and action functions δ and r may be nondeterministic functions δ and r not necessarily known to agent Machine Learning: Jordan Boyd-Graber UMD Machine Learning 6 / 24

Control Learning State Experience is a sequence of observations, actions, rewards o 1, r 1,a 1,...,a t 1,o t, r t (1) The state is a summary of experience s t = f (o 1, r 1,a 1,...,a t 1,o t, r t ) (2) In a fully observed environment s t = f (o t ) (3) Machine Learning: Jordan Boyd-Graber UMD Machine Learning 7 / 24

Control Learning Agent s Learning Task Execute actions in environment, observe results, and learn action policy π : S A that maximizes r t + γr t +1 + γ 2 r t +2 +... from any starting state in S here γ < 1 is the discount factor for future rewards Note something new: Target function is π : S A but we have no training examples of form s,a training examples are of form s,a, r Machine Learning: Jordan Boyd-Graber UMD Machine Learning 8 / 24

Control Learning What makes an RL agent? Policy: agent s behaviour function Value function: how good is each state and/or action Model: agent s representation of the environment Machine Learning: Jordan Boyd-Graber UMD Machine Learning 9 / 24

Control Learning Policy A policy is the agent s behavior It is a map from state to action: Deterministic policy: a = π(s ) Stochastic policy: π(a s ) = p (a s ) Machine Learning: Jordan Boyd-Graber UMD Machine Learning 1 / 24

Value Function To begin, consider deterministic worlds... For each possible policy π the agent might adopt, we can define an evaluation function over states V π (s ) r t + γr t +1 + γ 2 r t +2 +... γ i r t +i where r t, r t +1,... are from following policy π starting at state s i = Machine Learning: Jordan Boyd-Graber UMD Machine Learning 11 / 24

Q -learning Restated, the task is to learn the optimal policy π π argmax π V π (s ),( s ) r (s,a ) (immediate reward) values 1 1 G Q (s,a ) values One optimal policy Machine Learning: Jordan Boyd-Graber UMD Machine Learning 12 / 24

Q -learning Restated, the task is to learn the optimal policy π π argmax π V π (s ),( s ) r (s,a ) (immediate reward) values Q (s,a ) values 81 9 81 72 9 81 1 81 9 1 G 72 81 One optimal policy Machine Learning: Jordan Boyd-Graber UMD Machine Learning 12 / 24

Q -learning Restated, the task is to learn the optimal policy π π argmax π V π (s ),( s ) r (s,a ) (immediate reward) values Q (s,a ) values One optimal policy G Machine Learning: Jordan Boyd-Graber UMD Machine Learning 12 / 24

What to Learn We might try to have agent learn the evaluation function V π (which we write as V ) It could then do a lookahead search to choose best action from any state s because A problem: π (s ) = argmax a [r (s,a ) + γv (δ(s,a ))] This works well if agent knows δ : S A S, and r : S A R But when it doesn t, it can t choose actions this way Machine Learning: Jordan Boyd-Graber UMD Machine Learning 13 / 24

Q Function Define new function very similar to V Q (s,a ) r (s,a ) + γv (δ(s,a )) If agent learns Q, it can choose optimal action even without knowing δ! π (s ) = argmax a [r (s,a ) + γv (δ(s,a ))] π (s ) = argmaxq (s,a ) a Q is the evaluation function the agent will learn Machine Learning: Jordan Boyd-Graber UMD Machine Learning 14 / 24

Training Rule to Learn Q Note Q and V closely related: V (s ) = max a Q (s,a ) Which allows us to write Q recursively as Q (s t,a t ) = r (s t,a t ) + γv (δ(s t,a t ))) = r (s t,a t ) + γmax a Q (s t +1,a ) Nice! Let ˆQ denote learner s current approximation to Q. Consider training rule ˆQ (s,a ) r + γmax a ˆQ (s,a ) where s is the state resulting from applying action a in state s Machine Learning: Jordan Boyd-Graber UMD Machine Learning 15 / 24

Value Function A value function is a prediction of future reward: How much reward will I get from action a in state s? Q -value function gives expected total reward from state s and action a under policy π with discount factor γ (future rewards mean less than immediate) Q π (s,a ) = r t +1 + γr t +2 + γ 2 r t +3 +... s,a (4) Machine Learning: Jordan Boyd-Graber UMD Machine Learning 16 / 24

A Value Function is Great! An optimal value function is the maximum achievable value Q (s,a ) = max π Q π (s,a ) = Q π (s,a ) (5) If you know the value function, you can derive policy π = argmaxq (s,a ) (6) a Machine Learning: Jordan Boyd-Graber UMD Machine Learning 17 / 24

Q Learning for Deterministic Worlds For each s,a initialize table entry ˆQ (s,a ) Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for ˆQ (s,a ) as follows: s s ˆQ (s,a ) r + γmax a ˆQ (s,a ) Machine Learning: Jordan Boyd-Graber UMD Machine Learning 18 / 24

Updating ˆQ R 72 63 1 81 9 63 R 1 81 a right Initial state: s 1 Next state: s 2 ˆQ (s 1,a r i g h t ) r + γmax a ˆQ (s 2,a ) if rewards non-negative, then +.9 max{63,81,1} = 9 ( s,a,n) ˆQ n+1 (s,a ) ˆQ n (s,a ) and ( s,a,n) ˆQ n (s,a ) Q (s,a ) ˆQ converges to Q. Machine Learning: Jordan Boyd-Graber UMD Machine Learning 19 / 24

Nondeterministic Case What if reward and next state are non-deterministic? We redefine V,Q by taking expected values V π (s ) E [r t + γr t +1 + γ 2 r t +2 +...] E [ γ i r t +i ] i = Q (s,a ) E [r (s,a ) + γv (δ(s,a ))] Machine Learning: Jordan Boyd-Graber UMD Machine Learning 2 / 24

Nondeterministic Case Q learning generalizes to nondeterministic worlds Alter training rule to ˆQ n (s,a ) (1 α n ) ˆQ n 1 (s,a ) + α n [r + max a ˆQ n 1 (s,a )] where 1 α n = 1 + visits n (s,a ) Can still prove convergence of ˆQ to Q [Watkins and Dayan, 1992] Machine Learning: Jordan Boyd-Graber UMD Machine Learning 21 / 24

Temporal Difference Learning Q learning: reduce discrepancy between successive Q estimates One step time difference: Why not two steps? Or n? Q (1) (s t,a t ) r t + γmax a Q (2) (s t,a t ) r t + γr t +1 + γ 2 max a ˆQ (s t +1,a ) ˆQ (s t +2,a ) Q (n) (s t,a t ) r t + γr t +1 + + γ (n 1) r t +n 1 + γ n max a Blend all of these: ˆQ (s t +n,a ) Q λ (s t,a t ) (1 λ) Q (1) (s t,a t ) + λq (2) (s t,a t ) + λ 2 Q (3) (s t,a t ) + Machine Learning: Jordan Boyd-Graber UMD Machine Learning 22 / 24

Temporal Difference Learning Q λ (s t,a t ) (1 λ) Q (1) (s t,a t ) + λq (2) (s t,a t ) + λ 2 Q (3) (s t,a t ) + Equivalent expression: Q λ (s t,a t ) = r t + γ[ (1 λ)max a TD(λ) algorithm uses above training rule Sometimes converges faster than Q learning ˆQ (s t,a t ) +λ Q λ (s t +1,a t +1 )] converges for learning V for any λ 1 (Dayan, 1992) Tesauro s TD-Gammon uses this algorithm Machine Learning: Jordan Boyd-Graber UMD Machine Learning 23 / 24

What if the number of states is huge and/or structured? Let s say we discover that state is bad In Q learning, we know nothing about similar states Machine Learning: Jordan Boyd-Graber UMD Machine Learning 24 / 24

What if the number of states is huge and/or structured? Let s say we discover that state is bad In Q learning, we know nothing about similar states Solution: Feature-based Representation Distance to closest ghost Distance to closest dot Number of ghosts Is Pacman in a tunnel? Machine Learning: Jordan Boyd-Graber UMD Machine Learning 24 / 24