Reinforcement Learning

Similar documents
Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Grundlagen der Künstlichen Intelligenz

Lecture 23: Reinforcement Learning

Elements of Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning and Control

REINFORCEMENT LEARNING

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Decision Theory: Q-Learning

Introduction to Reinforcement Learning

Decision Theory: Markov Decision Processes

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Machine Learning I Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning. Introduction

Lecture 1: March 7, 2018

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning Active Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CSC242: Intro to AI. Lecture 23

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Sequential Decision Problems

Reinforcement Learning. George Konidaris

CS 570: Machine Learning Seminar. Fall 2016

Basics of reinforcement learning

6 Reinforcement Learning

Reinforcement learning

16.410/413 Principles of Autonomy and Decision Making

Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

, and rewards and transition matrices as shown below:

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

CS 188: Artificial Intelligence

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Markov Decision Processes and Solving Finite Problems. February 8, 2017

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Temporal Difference Learning & Policy Iteration

Reinforcement Learning

Distributed Optimization. Song Chong EE, KAIST

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Markov Decision Processes (and a small amount of reinforcement learning)

Reinforcement Learning

Reinforcement Learning: the basics

Decision making, Markov decision processes

Reinforcement Learning: An Introduction

Probabilistic Planning. George Konidaris

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Autonomous Helicopter Flight via Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

An Adaptive Clustering Method for Model-free Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Markov Decision Processes Chapter 17. Mausam

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement learning

Reinforcement Learning

Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

(Deep) Reinforcement Learning

Chapter 16 Planning Based on Markov Decision Processes

Motivation for introducing probabilities

Reinforcement Learning and NLP

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Discrete planning (an introduction)

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Grundlagen der Künstlichen Intelligenz

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Some AI Planning Problems

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 9: Policy Gradient II 1

CS 4100 // artificial intelligence. Recap/midterm review!

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement learning an introduction

Reinforcement Learning II. George Konidaris

Continuous State Space Q-Learning for Control of Nonlinear Systems

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Reinforcement Learning II. George Konidaris

Recall that in finite dynamic optimization we are trying to optimize problems of the form. T β t u(c t ) max. c t. t=0 T. c t = W max. subject to.

Transcription:

CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act and to assign blame/credit to individual actions Examples when playing a game, only after many actions final result: win, loss, or draw. Robot fetching bagels from bakery Navigating the Web for collecting all CS pages Control problems (reactor control) Issues Agent knows the full environment a priori vs. unknown environment Agent can be passive (watch) or active (explore) Feedback (i.e. rewards) in terminal states only; or a bit of feedback in any state How to measure and estimate the utility of each action Environment fully observable, or partially observable Have model of environment and effects of action or not will adress these issues! Markov Decision Process Representation of Environment: finite set of states S set of actions A for each state s S Process At each discrete time step, the agent observes state s t S and then chooses action a t A. After that, the environment gives agent an immediate reward r t changes state to s t+ (can be probabilistic) Markov Decision Process Model: Initial state: S 0 Transition function: T(s,a,s ) T(s,a,s ) is the probability of moving from state s to s when executing action a. Reward function: R(s) Real valued reward that the agent receives for entering state s. Assumptions Markov property: T(s,a,s ) and R(s) only depend on current state s, but not on any states visited earlier. Extension: Function R may be non-deterministic as well Example + START Each other state has a reward of -0.0. 0.8 0. 0. move into desired direction with prob 80% move 90 degrees to left with prob 0% move 90 degrees to right with prob 0%

Policy Optimal Policies for Other Rewards Definition: + A policy π describes which action an agent selects in each state a=π(s) Utility For now: U([s 0,,s N ]) = Σ i R(s i ) Let P([s 0,,s N ] π, s 0 ) be the probability of state sequence [s 0,,s N ] when following policy π from state s 0 Expected utility: U π (s) = Σ U([s 0,,s N ]) P([s 0,,s N ] π, s 0 ) measure of quality of policy π Optimal policy π*: Policy with maximal U π (s) in each state s Utility (revisited) What happens to utility value when either the state space has no terminal states or the policy never directs the agent to a terminal state Utility becomes infinite Solution Discount factor 0 < γ < closer rewards count more than awards far in the future U([s 0,,s N ]) = Σ i γ i R(s i ) finite utility even for infinite state sequences How to Compute the Utility for a given Policy? Definition: U π (s) = Σ [ Σ i γ i R(s i ) ] P([s 0, s, ] π, s 0 =s) Recursive computation: U π (s) = R(s) + γσ s T(s, π(s), s ) U π (s ) + 0.8 0.76 0.705 0.868 0.655 0.98 0.660 0.6 + 0.88 Here: γ=.0, R(s)=-0.0 Bellman Update (for fixed π) How to Find the Optimal Policy π *? Goal: Solve set of n= S equations (one for each state) U π (s 0 ) = R(s 0 ) + γσ s T(s 0, π(s), s ) U π (s ) U π (s n ) = R(s n ) + γσ s T(s n, π(s), s ) U π (s ) Algorithm [Policy Evaluation]: i=0; U π 0 (s)=0 for all s i = i + U π i(s) = R(s) + γσ s T(s, π(s), s ) U π i-(s ) until difference between U π i and Uπ i- small enough Is policy π optimal? How can we tell? If π is not optimal, then there exists some state where π(s) argmax a Σ s T(s, a, s ) U π (s ) How to find the optimal policy π *? + + return U π i

How to Find the Optimal Policy π *? Algorithm [Policy Iteration]: U π = PolicyEvaluation(π,S,T,R) If [ max a Σ s T(s, a, s ) U π (s ) > Σ s T(s, π(s), s ) U π (s ) ] then» π(s) = argmax a Σ s T(s, a, s ) U π (s ) until π does not change any more return π Utility Policy Equivalence: If we know the optimal utility U(s) of each state, we can derive the optimal policy: π * (s) = argmax a Σ s T(s, a, s ) U(s ) If we know the optimal policy π *, we can compute the optimal utility of each state: PolicyEvaluation algorithm Bellman Equation: U(s) = R(s) + γ max a Σ s T(s, a, s ) U(s ) Necessary and sufficient condition for optimal U(s). Value Iteration Algorithm Algorithm [Value Iteration]: i=0; U 0 (s)=0 for all s i = i + U i (s) = R(s) + γ max a Σ s T(s, a, s ) U i- (s ) until difference between U i and U i- small enough return U i derive optimal policy via π * (s) = argmax a Σ s T(s, a, s ) U(s ) Convergence of Value Iteration Utility estimates 0.5 0-0.5 - (,) (,) (,) (,) (,) 0 5 0 5 0 5 0 Number of iterations Value iteration is guaranteed to converge to optimal U for 0 γ< Faster convergence for smaller γ (,) (,) Assumptions we made so far: Known state space S Known transition model T(s, a, s ) Known reward function R(s) not realistic for many real agents : Learn optimal policy with a priori unknown environment Assume fully observable environment (i.e. agent can tell it s state) Agent needs to explore environment (i.e. experimentation) Passive Task: Given a policy π, what is the utility function U π? Similar to Policy Evaluation, but unknown T(s, a, s ) and R(s) Approach: Agent experiments in the environment Trials: execute policy from start state until in terminal state. (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 + (,) -0.0 (,).0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,).0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -.0

Direct Utility Estimation Data: Trials of the form (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,).0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,).0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -.0 Average reward over all trials for each state independently Supervised Learning Problem Why is this less efficient than necessary? Ignores dependencies between states U π (s) = R(s) + γσ s T(s, π(s), s ) U π (s ) Adaptive Dynamic Programming (ADP) Run trials to learn model of environment (i.e. T and R) Memorize R(s) for all visited states Estimate fraction of times action a from state s leads to s Use PolicyEvaluation Algorithm on estimated model Problem? Data: Trials of the form Can (,) -0.0 be quite (,) costly -0.0 for (,) large -0.0 state (,) spaces -0.0 (,) -0.0 For (,) example, -0.0 (,) Backgammon -0.0 (,).0 has 0 50 states (,) -0.0 Learn (,) and -0.0 store all (,) transition -0.0 (,) probabilities -0.0 (,) and -0.0 (,) -0.0 rewards (,) -0.0 (,).0 (,) -0.0 PolicyEvaluation (,) -0.0 (,) needs -0.0 to solve (,) -0.0 linear program (,) -.0 with 0 50 equations and variables. Temporal Difference (TD) Learning Do not learn explicit model of environment! Use update rule that implicitly reflects transition probabilities. Method: Init U π (s) with R(s) when first visited After each transition, update with U π (s) = U π (s) + α [R(s) + γ U π (s ) - U π (s)] α is learning rate. α should decrease slowly over time, so that estimates stabilize eventually. Properties: No need to store model Only one update for each action (not full PolicyEvaluation) Data: (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,).0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,) -0.0 (,).0 Active Task: In an a priori unknown environment, find the optimal policy. unknown T(s, a, s ) and R(s) Agent must experiment with the environment. Naïve Approach: Naïve Active PolicyIteration Start with some random policy Follow policy to learn model of environment and use ADP to estimate utilities. Update policy using π(s) argmax a Σ s T(s, a, s ) U π (s ) Can converge to sub-optimal policy! By following policy, agent might never learn T and R everywhere. Need for exploration! Exploration vs. Exploitation Exploration: Take actions that explore the environment Hope: possibly find areas in the state space of higher reward Problem: possibly take suboptimal steps Exploitation: Follow current policy Guaranteed to get certain + expected reward? Approach: Sometimes take random? steps Bonus reward for states that have not been visited often yet Q-Learning Agent needs model of environment to select action via argmax a Σ s T(s, a, s ) U π (s ) Solution: Learn action utility function Q(a,s), not state utility function U(s). Define Q(a,s) as U(s) = max a Q(a,s) Bellman equation with Q(a,s) instead of U(s) Q(a,s) = R(s) + γ Σ s T(s, a, s ) max a Q(a,s ) TD-Update with Q(a,s) instead of U(s) Q(a,s) Q(a,s) + α [R(s) + γ max a Q(a,s ) - Q(a,s)] Result: With Q-function, agent can select action without model of environment argmax a Q(a,s)

Q-Learning Illustration Function Approximation Q(up,(,)) Q(right,(,)) Q(down,(,)) Q(left,(,)) Q(up,(,)) Q(right,(,)) Q(down,(,)) Q(left,(,)) Q(up,(,)) Q(right,(,)) Q(down,(,)) Q(left,(,)) + Storing Q or U,T,R for each state in a table is too expensive, if number of states is large Does not exploit similarity of states (i.e. agent has to learn separate behavior for each state, even if states are similar) Solution: Approximate function using parametric representation For example: Ф(s) is feature vector describing the state Material values of board Is the queen threatened?