Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Similar documents
Reinforcement Learning. Introduction

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Reinforcement Learning and Deep Reinforcement Learning

Lecture 3: Markov Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCEMENT LEARNING

CS599 Lecture 1 Introduction To RL

Introduction to Reinforcement Learning

The Reinforcement Learning Problem

Reinforcement Learning and NLP

Lecture 1: March 7, 2018

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CSC321 Lecture 22: Q-Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Approximate Q-Learning. Dan Weld / University of Washington

(Deep) Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning and Control

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

Markov Decision Processes

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning

Internet Monetization

Grundlagen der Künstlichen Intelligenz

Artificial Intelligence

CS 598 Statistical Reinforcement Learning. Nan Jiang

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Chapter 3: The Reinforcement Learning Problem

CS 4100 // artificial intelligence. Recap/midterm review!

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning

Lecture 7: Value Function Approximation

6 Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Markov decision processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Q-learning. Tambet Matiisen

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning for NLP

Final Exam December 12, 2017

Approximation Methods in Reinforcement Learning

An Introduction to Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Planning in Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Human-level control through deep reinforcement. Liia Butler

Final Exam December 12, 2017

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Discrete planning (an introduction)

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning Part 2

Reinforcement Learning

Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

The Markov Decision Process (MDP) model

Introduction to Artificial Intelligence (AI)

An Introduction to Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Planning Under Uncertainty II

November 28 th, Carlos Guestrin 1. Lower dimensional projections

MDP Preliminaries. Nan Jiang. February 10, 2019

A Gentle Introduction to Reinforcement Learning

Reinforcement Learning II. George Konidaris

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning II. George Konidaris

Autonomous Helicopter Flight via Reinforcement Learning

Reinforcement Learning

David Silver, Google DeepMind

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

The convergence limit of the temporal difference learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

An Introduction to Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Lecture 8: Policy Gradient

Markov Decision Processes and Dynamic Programming

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Probability and Time: Hidden Markov Models (HMMs)

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Transcription:

COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver

Course progress Learning from examples Definition + fundamental theorem of statistical learning, motivated efficient algorithms/optimization Convexity, greedy optimization gradient descent Neural networks Knowledge Representation NLP Logic Bayes nets Optimization: MCMC HMM Today: reinforcement learning part 1

Admin (programming) exercise MCMC announced today Due in 1 week in class, as usual

Decisions and planning Thus far: Learning from examples Knowledge representation / language inference/prediction Missing: actions/decisions Learn from interaction RL: no supervisor, only a reward signal Feedback is delayed Time really matters (sequential, non i.i.d data) Agent s actions affect the subsequent data it receives

RL - examples Fly stunt maneuvers in a helicopter Defeat the world champion at Backgammon (& Go) Control a power station Make a humanoid robot walk Play Atari games better than humans

Reward hypothesis Agent goal: maximize cumulative reward Hypothesis: All goals can be described by the maximization of expected cumulative reward (?) Examples: Fly stunt maneuvers in a helicopter: +ve reward for following desired trajectory ve reward for crashing Backgammon: +/ ve reward for winning/losing a game Make a humanoid robot walk: +ve reward for forward motion ve reward for falling over Play many different Atari games: +/ ve reward for increasing/decreasing score

Sequential decision making Agent takes action Nature responds with reward Agent sees observation Agent has internal state (from all previous observations) s " = f H ", H " = {o *, r *, a *,, o ".*, r ".*, a ".*, o ", r " } Markovian assumption: state, observation, reward are independent on past given current state Pr[ s " s ".* ] = Pr[ s " s *,, s ".* ]

Markovian? State? Actions? Rewards?

Robot in a room +1-1 actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT START reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step states actions rewards what is the solution?

Is this a solution? +1-1 only if actions deterministic not in this case (actions are stochastic) solution/policy mapping from each state to an action

Optimal policy +1-1

Reward for each step -2 +1-1

Reward for each step: - +1-1

Reward for each step: -0.04 +1-1

Reward for each step: -0.01 +1-1

Reward for each step: +0.01 +1-1

Formal model: Markov Decision Process States: Markov Process (chain) Rewards: Markov Reward Process Decisions: Markov Decision Process

Markov Process: the student chain 0.9 1 FB C P S facebook sleep FB 0.9 C 0.4 Class P 0.4 S 1 Party

Example: the student chain 0.9 facebook sleep 1 Example of episodes (random walks): C C Fb Fb C P S Class C Fb Fb Fb Fb C P S 0.4 Party

Markov Reward Process: the student REWARD chain 0.9 0.8 FB C P S Facebook R=-1 Sleep R=0 FB 0.9 C 0.4 Class R=2 P 0.4 S 1 Party R=1

Example: the student REWARD chain Markov Reward Process, definition: Tuple (S, P, R, γ) where S = states, including start state P = transition matrix P ;;< = Pr[S "=* = s S " = s] R = reward function, R ; = E[R "=* S " = s] γ [0,1] = discount factor 0.9 Facebook R=-1 Class R=2 Slee p R=0 0.8 Return Exponentially diminishing returns G " = D R "=E γ E.* EF* "G H 0.4 Party R=1 why? γ = 0? γ = 1? With discount factor < 1 à G t always well defined, regardless of stationarity

Example: the student REWARD chain 0.9 Facebook R=-1 Class R=2 0.4 Sleep R=0 0.8 Example of episode (random walks): discount factor = ½ C C Fb Fb S total reward = G * = 2 + 2 1 2 + 1 1 2 M + 1 P + 0 = 2 + 1 1 4 1 8 = 3 0.365 = 2.635 Party R=1 C Fb Fb Fb Fb C P S G 1 =?

The Value function Mapping from states to real numbers: v s = E[G " S " = s " ] 0.9 Facebook R=-1 Slee p R=0 0.8 Class R=2 0.4 Party R=1

Value function, γ = 0 0.9 Facebook R=-1 V=-1 Sleep R=0 V=0 0.8 FB C FB C P S 0.9 0.4 Class R=2 V=2 P 0.4 S 1 Party R=1 V=1 v s = E[G " S " = s " ] G " = D R "=E γ E.* EF* "G H

Computing the value function How can we compute it? 0.9 Facebook R=-1 Slee p R=0 0.8 v s = E[G " S " = s " ] Class R=2 0.4 Party R=1

The Bellman equation for MRP v s = R ; + γ D P ;; Vv(s < ) ; 0.9 Facebook R=-1 Slee p R=0 0.8 R ; = E R "=* S " = s Class R=2 0.4 Party R=1 P ;;< = transition probability from s to s

The Bellman equation for MRP v s = E G " S " = s = E D γ E.* R "=E EF* "G H S " = s 0.9 Facebook R=-1 Slee p R=0 0.8 R ; = E R "=* S " = s = E R "=* + γ D γ E.* R "=*=E EF* "G H S " = s Class R=2 0.4 = E R "=* + γg "=* S " = s Party R=1 P ;;< = transition probability from s to s = E R "=* + γ v(s "=* ) S " = s = R ; + γ D P ;; V v(s < ) ;

Bellman equation in matrix form How can we compute it? v s = R ; + γ D P ;; Vv(s < ) v = R + γ P v For v being the vector of values v(s), R being vector in same space of R s, s S, and P being the transition matrix. Thus, v = I γp.* R System of linear equations (Gaussian elimination, cubic time) ;

Markov Decision Process Markov Reward Process, definition: Tuple (S, P, R, A, γ) where S = states, including start state A = set of possible actions P = transition matrix P Z ;;< = Pr[S "=* = s S " = s, A " = a] R = reward function, R Z ; = E[R "=* S " = s, A " = a] γ [0,1] = discount factor Return G " = D R "=E γ E.* EF* "G H Goal: take actions to maximize expected return

Policies The Markovian structure è best action depends only on current state! Policy = mapping from state to distribution over actions π: S Δ(A), π a s = Pr[A " = a S " = s] Given a policy, the MDP reduces to a Markov Reward Process

Reminder: MDP1 +1-1 actions: UP, DOWN, LEFT, RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT START reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step states actions rewards Policies?

Reminder 2 State? Actions? Rewards? Policy?

Policies: action = study 0.8 Facebook R=-1 0.9 Sleep R=0 Class R=2 0.8 0.9 Party R=1

Policies: action = facebook 0.9 0.8 Facebook R=-1 Sleep R=0 0.9 Class R=2 0.05 0.05 0.9 Party R=1

Fixed policy à Markov Reward process Given a policy, the MDP reduces to a Markov Reward Process P ;;< = Z a π a s P Z ;;< R _ ; = π a s R Z ; Z a Value function for policy: v _ s = E _ [G " S " = s] Action-Value function for policy: q _ s, a = E _ [G " S " = s, A " = a] How to compute the best policy?

The Bellman equation Policies satisfy the Bellman equation: v _ s = E _ R "=* + γ v _ (S "=* ) S " = s = R _ ; + γ D P _ ;; Vv _ (s < ) And similarly for value-action function: ; v _ s = D π a s q _ (s, a) Z a Optimal value function, and value-action function v s = max _ Important: v s = max Z v _ s q s, a = max _ q (s, a), why? q _ (s, a)

Theorem There exists an optimal policy π (it is deterministic!) All optimal policy achieve the same optimal value v (s) at every state, and the same optimal value-action function q s, a at every state and for every action. How can we find it? Bellman equation: v s = max Z optimality equations: q (s, a) implies Bellman q s, a = R Z ; + γ D P Z ;;< ;< max Z< q (s,a ) v s = max Z R Z ; + γ D P Z ;; Vv (s < ) ;<

Summary Markov Reward Process generalization of Markov Chains Markov Decision Processes formalization of learning with state from environment observations in a Markovian world. Bellman equation: fundamental recursive property of MDPs Will enable algorithms (next class )