An Introduction to Reinforcement Learning

Similar documents
An Introduction to Reinforcement Learning

Lecture 23: Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Temporal Difference Learning & Policy Iteration

REINFORCEMENT LEARNING

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning. Introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Introduction to Reinforcement Learning

Reinforcement Learning: the basics

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Grundlagen der Künstlichen Intelligenz

Open Theoretical Questions in Reinforcement Learning

Introduction to Reinforcement Learning

Machine Learning I Reinforcement Learning

Reinforcement Learning (1)

Reinforcement Learning and Control

Lecture 1: March 7, 2018

Reinforcement Learning

Reinforcement learning

University of Alberta

Q-Learning in Continuous State Action Spaces

An Adaptive Clustering Method for Model-free Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Temporal difference learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

David Silver, Google DeepMind

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement learning

(Deep) Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Deep Reinforcement Learning

Reinforcement Learning. George Konidaris

Trust Region Policy Optimization

Reinforcement Learning II. George Konidaris

Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Basics of reinforcement learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning. Yishay Mansour Tel-Aviv University

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Lecture 8: Policy Gradient I 2

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Artificial Intelligence

Reinforcement Learning: An Introduction

Reinforcement Learning

Reinforcement Learning

Lecture 9: Policy Gradient II 1

Lecture 4: Approximate dynamic programming

Policy Gradient Methods. February 13, 2017

CSC321 Lecture 22: Q-Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Elements of Reinforcement Learning

State Space Abstractions for Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Grundlagen der Künstlichen Intelligenz

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

ilstd: Eligibility Traces and Convergence Analysis

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 9: Policy Gradient II (Post lecture) 2

Autonomous Helicopter Flight via Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

CS 570: Machine Learning Seminar. Fall 2016

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Efficient Learning in Linearly Solvable MDP Models

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

, and rewards and transition matrices as shown below:

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning and Deep Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning and NLP

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

An application of the temporal difference algorithm to the truck backer-upper problem

RL 3: Reinforcement Learning

Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Notes on Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

The convergence limit of the temporal difference learning

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Q-learning. Tambet Matiisen

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Transcription:

An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018

What is Reinforcement Learning? Shivaram Kalyanakrishnan 1/21

What is Reinforcement Learning? [Video 1 of toddler learning to walk] 1. https://www.youtube.com/watch?v=jizuy9fcf1k Shivaram Kalyanakrishnan 1/21

What is Reinforcement Learning? [Video 1 of toddler learning to walk] 1. https://www.youtube.com/watch?v=jizuy9fcf1k Shivaram Kalyanakrishnan 1/21

What is Reinforcement Learning? [Video 1 of toddler learning to walk] Learning to Drive a Bicycle using Reinforcement Learning and Shaping Jette Randløv and Preben Alstrøm. ICML 1998. 1. https://www.youtube.com/watch?v=jizuy9fcf1k Shivaram Kalyanakrishnan 1/21

What is Reinforcement Learning? [Video 1 of toddler learning to walk] Learning to Drive a Bicycle using Reinforcement Learning and Shaping Jette Randløv and Preben Alstrøm. ICML 1998. Learning by trial and error to perform sequential decision making. 1. https://www.youtube.com/watch?v=jizuy9fcf1k Shivaram Kalyanakrishnan 1/21

Our View of Reinforcement Learning Operations Research (Dynamic Programming) Control Theory Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science Shivaram Kalyanakrishnan 2/21

Our View of Reinforcement Learning Operations Research (Dynamic Programming) Control Theory Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science B. F. Skinner Shivaram Kalyanakrishnan 2/21

Our View of Reinforcement Learning R. E. Bellman Operations Research (Dynamic Programming) Control Theory Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science B. F. Skinner Shivaram Kalyanakrishnan 2/21

Our View of Reinforcement Learning R. E. Bellman Operations Research (Dynamic Programming) Control Theory D. P. Bertsekas Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science B. F. Skinner Shivaram Kalyanakrishnan 2/21

Our View of Reinforcement Learning R. E. Bellman Operations Research (Dynamic Programming) Control Theory D. P. Bertsekas Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science B. F. Skinner W. Schultz Shivaram Kalyanakrishnan 2/21

Our View of Reinforcement Learning R. E. Bellman Operations Research (Dynamic Programming) Control Theory D. P. Bertsekas Psychology (Animal Behaviour) Reinforcement Learning Neuroscience Artificial Intelligence and Computer Science B. F. Skinner W. Schultz R. S. Sutton Shivaram Kalyanakrishnan 2/21

Resources Reinforcement Learning: A Survey. Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. JAIR 1996. Shivaram Kalyanakrishnan 3/21

Resources Reinforcement Learning: A Survey. Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. JAIR 1996. Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto. MIT Press, 1998. (2017-8 draft also now on-line). Algorithms for Reinforcement Learning Csaba Szepesvári. Morgan & Claypool, 2010. Shivaram Kalyanakrishnan 3/21

Today s Class 1. Markov Decision Problems 2. Planning and learning 3. Deep Reinforcement Learning 4. Summary Shivaram Kalyanakrishnan 4/21

Today s Class 1. Markov Decision Problems 2. Planning and learning 3. Deep Reinforcement Learning 4. Summary Shivaram Kalyanakrishnan 4/21

Markov Decision Problem s t+1 r t+1 ENVIRONMENT T R state s t reward r t LEARNING AGENT action a t π : S A S: set of states. A: set of actions. T : transition function. s S, a A, T (s, a) is a distribution over S. R: reward function. s, s S, a A, R(s, a, s ) is a finite real number. γ: discount factor. 0 γ < 1. Shivaram Kalyanakrishnan 5/21

Markov Decision Problem s t+1 r t+1 ENVIRONMENT T R state s t reward r t LEARNING AGENT action a t π : S A S: set of states. A: set of actions. T : transition function. s S, a A, T (s, a) is a distribution over S. R: reward function. s, s S, a A, R(s, a, s ) is a finite real number. γ: discount factor. 0 γ < 1. Trajectory over time: s 0, a 0, r 0, s 1, a 1, r 1,..., s t, a t, r t, s t+1,.... Shivaram Kalyanakrishnan 5/21

Markov Decision Problem s t+1 r t+1 ENVIRONMENT T R state s t reward r t LEARNING AGENT action a t π : S A S: set of states. A: set of actions. T : transition function. s S, a A, T (s, a) is a distribution over S. R: reward function. s, s S, a A, R(s, a, s ) is a finite real number. γ: discount factor. 0 γ < 1. Trajectory over time: s 0, a 0, r 0, s 1, a 1, r 1,..., s t, a t, r t, s t+1,.... Value, or expected long-term reward, of state s under policy π: V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... to s 0 = s, a i = π(s i )]. Shivaram Kalyanakrishnan 5/21

Markov Decision Problem s t+1 r t+1 ENVIRONMENT T R state s t reward r t LEARNING AGENT action a t π : S A S: set of states. A: set of actions. T : transition function. s S, a A, T (s, a) is a distribution over S. R: reward function. s, s S, a A, R(s, a, s ) is a finite real number. γ: discount factor. 0 γ < 1. Trajectory over time: s 0, a 0, r 0, s 1, a 1, r 1,..., s t, a t, r t, s t+1,.... Value, or expected long-term reward, of state s under policy π: V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... to s 0 = s, a i = π(s i )]. Objective: Find π such that V π (s) is maximal s S. Shivaram Kalyanakrishnan 5/21

Examples What are the agent and environment? What are S, A, T, and R? Shivaram Kalyanakrishnan 6/21

Examples What are the agent and environment? What are S, A, T, and R? 1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif Shivaram Kalyanakrishnan 6/21

Examples What are the agent and environment? What are S, A, T, and R? An Application of Reinforcement Learning to Aerobatic Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng. NIPS 2006. 1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif 2. http://www.aviationspectator.com/files/images/ SH-3-Sea-King-helicopter-191.preview.jpg Shivaram Kalyanakrishnan 6/21

Examples What are the agent and environment? What are S, A, T, and R? [Video 3 of Tetris] An Application of Reinforcement Learning to Aerobatic Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y. Ng. NIPS 2006. 1. http://www.chess-game-strategies.com/images/kqa_chessboard_large-picture_2d.gif 2. http://www.aviationspectator.com/files/images/ SH-3-Sea-King-helicopter-191.preview.jpg 3. https://www.youtube.com/watch?v=khhzyghxsee Shivaram Kalyanakrishnan 6/21

Today s Class 1. Markov decision problems 2. Planning and learning 3. Deep Reinforcement Learning 4. Summary Shivaram Kalyanakrishnan 7/21

Bellman s Equations Recall that V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... s 0 = s, a i = π(s i )]. Bellman s Equations ( s S): V π (s) = s S T (s, π(s), s ) [R(s, π(s), s ) + γv π (s )]. V π is called the value function of π. Shivaram Kalyanakrishnan 8/21

Bellman s Equations Recall that V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... s 0 = s, a i = π(s i )]. Bellman s Equations ( s S): V π (s) = s S T (s, π(s), s ) [R(s, π(s), s ) + γv π (s )]. V π is called the value function of π. Define ( s S, a A): Q π (s, a) = s S T (s, a, s ) [R(s, a, s ) + γv π (s )]. Q π is called the action value function of π. V π (s) = Q π (s, π(s)). Shivaram Kalyanakrishnan 8/21

Bellman s Equations Recall that V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... s 0 = s, a i = π(s i )]. Bellman s Equations ( s S): V π (s) = s S T (s, π(s), s ) [R(s, π(s), s ) + γv π (s )]. V π is called the value function of π. Define ( s S, a A): Q π (s, a) = s S T (s, a, s ) [R(s, a, s ) + γv π (s )]. Q π is called the action value function of π. V π (s) = Q π (s, π(s)). The variables in Bellman s Equations are the V π (s). S linear equations in S unknowns. Shivaram Kalyanakrishnan 8/21

Bellman s Equations Recall that V π (s) = E[r 0 + γr 1 + γ 2 r 2 +... s 0 = s, a i = π(s i )]. Bellman s Equations ( s S): V π (s) = s S T (s, π(s), s ) [R(s, π(s), s ) + γv π (s )]. V π is called the value function of π. Define ( s S, a A): Q π (s, a) = s S T (s, a, s ) [R(s, a, s ) + γv π (s )]. Q π is called the action value function of π. V π (s) = Q π (s, π(s)). The variables in Bellman s Equations are the V π (s). S linear equations in S unknowns. Thus, given S, A, T, R, γ, and a fixed policy π, we can solve Bellman s equations efficiently to obtain, s S, a A, V π (s) and Q π (s, a). Shivaram Kalyanakrishnan 8/21

Bellman s Optimality Equations Let Π be the set of all policies. What is its cardinality? Shivaram Kalyanakrishnan 9/21

Bellman s Optimality Equations Let Π be the set of all policies. What is its cardinality? It can be shown that there exists a policy π Π such that π Π s S: V π (s) V π (s). V π is denoted V, and Q π is denoted Q. There could be multiple optimal policies π, but V and Q are unique. Shivaram Kalyanakrishnan 9/21

Bellman s Optimality Equations Let Π be the set of all policies. What is its cardinality? It can be shown that there exists a policy π Π such that π Π s S: V π (s) V π (s). V π is denoted V, and Q π is denoted Q. There could be multiple optimal policies π, but V and Q are unique. Bellman s Optimality Equations ( s S): V (s) = max a A s S T (s, a, s ) [R(s, a, s ) + γv (s )]. Shivaram Kalyanakrishnan 9/21

Bellman s Optimality Equations Let Π be the set of all policies. What is its cardinality? It can be shown that there exists a policy π Π such that π Π s S: V π (s) V π (s). V π is denoted V, and Q π is denoted Q. There could be multiple optimal policies π, but V and Q are unique. Bellman s Optimality Equations ( s S): V (s) = max a A s S T (s, a, s ) [R(s, a, s ) + γv (s )]. V can be computed up to arbitrary precision using Value Iteration. Shivaram Kalyanakrishnan 9/21

Planning and Learning Planning problem: Given S, A, T, R, γ, how can we find an optimal policy π? We need to be computationally efficient. Shivaram Kalyanakrishnan 10/21

Planning and Learning Planning problem: Given S, A, T, R, γ, how can we find an optimal policy π? We need to be computationally efficient. Learning problem: Given S, A, γ, and the facility to follow a trajectory by sampling from T and R, how can we find an optimal policy π? We need to be sampleefficient. Shivaram Kalyanakrishnan 10/21

The Learning Problem 1 2 3 8 4 7 6 5 9 10 Shivaram Kalyanakrishnan 11/21

The Learning Problem 3 0.35 0.5 1 2 1 0.15 4 6 8 7 5 9 10 Shivaram Kalyanakrishnan 11/21

The Learning Problem 3 0.35 0.5 1 2 1 0.15 4 8 7 6 5 9 10 Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 2 3 8 4 7 6 5 9 10 Shivaram Kalyanakrishnan 11/21

The Learning Problem 4 3 6 8 2 1 0 4 10 7 1 2 2 6 3 4 9 10 2 5 Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 2 3 8 4 7 6 5 9 10 Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 1 2 3 8 4 7 6 5 9 10 r 0 = 1. Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 1 2 3 8 4 7 2 6 5 9 10 r 0 = 1. r 1 = 2. Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 1 2 3 8 4 10 7 2 6 5 9 10 r 0 = 1. r 1 = 2. r 2 = 10. Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 1 2 3 8 4 10 7 2 6 5 9 10 r 0 = 1. r 1 = 2. r 2 = 10. How to take actions so as to maximise expected long-term reward E[r 0 + γr 1 + γ 2 r 2 +... ]? Shivaram Kalyanakrishnan 11/21

The Learning Problem 1 1 2 3 8 4 10 7 2 6 5 9 10 r 0 = 1. r 1 = 2. r 2 = 10. How to take actions so as to maximise expected long-term reward E[r 0 + γr 1 + γ 2 r 2 +... ]? [Note that there exists an (unknown) optimal policy.] Shivaram Kalyanakrishnan 11/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Shivaram Kalyanakrishnan 12/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Update these estimates based on experience (s t, a t, r t, s t+1 ): Q(s t, a t ) Q(s t, a t ) + α t{r t + max a Q(s t+1, a) Q(s t, a t )}. Shivaram Kalyanakrishnan 12/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Update these estimates based on experience (s t, a t, r t, s t+1 ): Q(s t, a t ) Q(s t, a t ) + α t{r t + max a Q(s t+1, a) Q(s t, a t )}. Shivaram Kalyanakrishnan 12/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Update these estimates based on experience (s t, a t, r t, s t+1 ): Q(s t, a t ) Q(s t, a t ) + α t{r t + max a Q(s t+1, a) Q(s t, a t )}. Shivaram Kalyanakrishnan 12/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Update these estimates based on experience (s t, a t, r t, s t+1 ): Q(s t, a t ) Q(s t, a t ) + α t{r t + max a Q(s t+1, a) Q(s t, a t )}. Act greedily based on the estimates (exploit) most of the time, but still Make sure to explore each action enough times. Shivaram Kalyanakrishnan 12/21

Q-Learning Keep a running estimate of the expected long-term reward obtained by taking each action from each state s, and acting optimally thereafter. Q red green 1-0.2 10 2 4.5 13 3 6-8 4 0 0.2 5-4.2-4.2 6 1.2 1.6 7 10 6 8 4.8 9.9 9 5.0-3.4 10-1.9 2.3 Update these estimates based on experience (s t, a t, r t, s t+1 ): Q(s t, a t ) Q(s t, a t ) + α t{r t + max a Q(s t+1, a) Q(s t, a t )}. Act greedily based on the estimates (exploit) most of the time, but still Make sure to explore each action enough times. Q-learning will converge and induce an optimal policy! Shivaram Kalyanakrishnan 12/21

Q-Learning Algorithm Let Q be our guess of Q : for every state s and action a, initialise Q(s, a) arbitrarily. We will start in some state s 0. For t = 0, 1, 2,... Take an action a t, chosen uniformly at random with probability ɛ, and to be argmax a Q(s t, a) with probability 1 ɛ. The environment will generate next state s t+1 and reward r t+1. Update: Q(s t, a t ) Q(s t, a t ) + α t (r t+1 + γ max a A Q(s t+1, a) Q(s t, a t )). [ɛ: parameter for ɛ-greedy exploration] [α t: learning rate] [r t+1 +γ max a A Q(s t+1, a) Q(s t, a t ): temporal difference prediction error] Shivaram Kalyanakrishnan 13/21

Q-Learning Algorithm Let Q be our guess of Q : for every state s and action a, initialise Q(s, a) arbitrarily. We will start in some state s 0. For t = 0, 1, 2,... Take an action a t, chosen uniformly at random with probability ɛ, and to be argmax a Q(s t, a) with probability 1 ɛ. The environment will generate next state s t+1 and reward r t+1. Update: Q(s t, a t ) Q(s t, a t ) + α t (r t+1 + γ max a A Q(s t+1, a) Q(s t, a t )). [ɛ: parameter for ɛ-greedy exploration] [α t: learning rate] [r t+1 +γ max a A Q(s t+1, a) Q(s t, a t ): temporal difference prediction error] For ɛ (0, 1] and α t = 1 t, it can be proven that as t, Q Q. Q-Learning Christopher J. C. H. Watkins and Peter Dayan. Machine Learning, 1992. Shivaram Kalyanakrishnan 13/21

Practice In Spite of the Theory! Task State State Policy Representation Aliasing Space (Number of features) Backgammon (T1992) Absent Discrete Neural network (198) Job-shop scheduling (ZD1995) Absent Discrete Neural network (20) Tetris (BT1906) Absent Discrete Linear (22) Elevator dispatching (CB1996) Present Continuous Neural network (46) Acrobot control (S1996) Absent Continuous Tile coding (4) Dynamic channel allocation (SB1997) Absent Discrete Linear (100 s) Active guidance of finless rocket (GM2003) Present Continuous Neural network (14) Fast quadrupedal locomotion (KS2004) Present Continuous Parameterized policy (12) Robot sensing strategy (KF2004) Present Continuous Linear (36) Helicopter control (NKJS2004) Present Continuous Neural network (10) Dynamic bipedal locomotion (TZS2004) Present Continuous Feedback control policy (2) Adaptive job routing/scheduling (WS2004) Present Discrete Tabular (4) Robot soccer keepaway (SSK2005) Present Continuous Tile coding (13) Robot obstacle negotiation (LSYSN2006) Present Continuous Linear (10) Optimized trade execution (NFK2007) Present Discrete Tabular (2-5) Blimp control (RPHB2007) Present Continuous Gaussian Process (2) 9 9 Go (SSM2007) Absent Discrete Linear ( 1.5 million) Ms. Pac-Man (SL2007) Absent Discrete Rule list (10) Autonomic resource allocation (TJDB2007) Present Continuous Neural network (2) General game playing (FB2008) Absent Discrete Tabular (part of state space) Soccer opponent hassling (GRT2009) Present Continuous Neural network (9) Adaptive epilepsy treatment (GVAP2008) Present Continuous Extremely rand. trees (114) Computer memory scheduling (IMMC2008) Absent Discrete Tile coding (6) Motor skills (PS2008) Present Continuous Motor primitive coeff. (100 s) Combustion Control (HNGK2009) Present Continuous Parameterized policy (2-3) Shivaram Kalyanakrishnan 14/21

Practice In Spite of the Theory! Task State State Policy Representation Aliasing Space (Number of features) Backgammon (T1992) Absent Discrete Neural network (198) Job-shop scheduling (ZD1995) Absent Discrete Neural network (20) Tetris (BT1906) Absent Discrete Linear (22) Elevator dispatching (CB1996) Present Continuous Neural network (46) Acrobot control (S1996) Absent Continuous Tile coding (4) Dynamic channel allocation (SB1997) Absent Discrete Linear (100 s) Active guidance of finless rocket (GM2003) Present Continuous Neural network (14) Fast quadrupedal locomotion (KS2004) Present Continuous Parameterized policy (12) Robot sensing strategy (KF2004) Present Continuous Linear (36) Helicopter control (NKJS2004) Present Continuous Neural network (10) Dynamic bipedal locomotion (TZS2004) Present Continuous Feedback control policy (2) Adaptive job routing/scheduling (WS2004) Present Discrete Tabular (4) Robot soccer keepaway (SSK2005) Present Continuous Tile coding (13) Robot obstacle negotiation (LSYSN2006) Present Continuous Linear (10) Optimized trade execution (NFK2007) Present Discrete Tabular (2-5) Blimp control (RPHB2007) Present Continuous Gaussian Process (2) 9 9 Go (SSM2007) Absent Discrete Linear ( 1.5 million) Ms. Pac-Man (SL2007) Absent Discrete Rule list (10) Autonomic resource allocation (TJDB2007) Present Continuous Neural network (2) General game playing (FB2008) Absent Discrete Tabular (part of state space) Soccer opponent hassling (GRT2009) Present Continuous Neural network (9) Adaptive epilepsy treatment (GVAP2008) Present Continuous Extremely rand. trees (114) Computer memory scheduling (IMMC2008) Absent Discrete Tile coding (6) Motor skills (PS2008) Present Continuous Motor primitive coeff. (100 s) Combustion Control (HNGK2009) Present Continuous Parameterized policy (2-3) Shivaram Kalyanakrishnan 14/21

Practice In Spite of the Theory! Task State State Policy Representation Aliasing Space (Number of features) Backgammon (T1992) Absent Discrete Neural network (198) Job-shop scheduling (ZD1995) Absent Discrete Neural network (20) Tetris (BT1906) Absent Discrete Linear (22) Elevator dispatching (CB1996) Present Continuous Neural network (46) Acrobot control (S1996) Absent Continuous Tile coding (4) Dynamic channel allocation (SB1997) Absent Discrete Linear (100 s) Active guidance of finless rocket (GM2003) Present Continuous Neural network (14) Fast quadrupedal locomotion (KS2004) Present Continuous Parameterized policy (12) Robot sensing strategy (KF2004) Present Continuous Linear (36) Helicopter control (NKJS2004) Present Continuous Neural network (10) Dynamic bipedal locomotion (TZS2004) Present Continuous Feedback control policy (2) Adaptive job routing/scheduling (WS2004) Present Discrete Tabular (4) Robot soccer keepaway (SSK2005) Present Continuous Tile coding (13) Robot obstacle negotiation (LSYSN2006) Present Continuous Linear (10) Optimized trade execution (NFK2007) Present Discrete Tabular (2-5) Blimp control (RPHB2007) Present Continuous Gaussian Process (2) 9 9 Go (SSM2007) Absent Discrete Linear ( 1.5 million) Ms. Pac-Man (SL2007) Absent Discrete Rule list (10) Autonomic resource allocation (TJDB2007) Present Continuous Neural network (2) General game playing (FB2008) Absent Discrete Tabular (part of state space) Soccer opponent hassling (GRT2009) Present Continuous Neural network (9) Adaptive epilepsy treatment (GVAP2008) Present Continuous Extremely rand. trees (114) Computer memory scheduling (IMMC2008) Absent Discrete Tile coding (6) Motor skills (PS2008) Present Continuous Motor primitive coeff. (100 s) Combustion Control (HNGK2009) Present Continuous Parameterized policy (2-3) Shivaram Kalyanakrishnan 14/21

Practice In Spite of the Theory! Task State State Policy Representation Aliasing Space (Number of features) Backgammon (T1992) Absent Discrete Neural network (198) Job-shop scheduling (ZD1995) Absent Discrete Neural network (20) Tetris (BT1906) Absent Discrete Linear (22) Elevator dispatching (CB1996) Present Continuous Neural network (46) Acrobot control (S1996) Absent Continuous Tile coding (4) Dynamic channel allocation (SB1997) Absent Discrete Linear (100 s) Active guidance of finless rocket (GM2003) Present Continuous Neural network (14) Fast quadrupedal locomotion (KS2004) Present Continuous Parameterized policy (12) Robot sensing strategy (KF2004) Present Continuous Linear (36) Helicopter control (NKJS2004) Present Continuous Neural network (10) Dynamic bipedal locomotion (TZS2004) Present Continuous Feedback control policy (2) Adaptive job routing/scheduling (WS2004) Present Discrete Tabular (4) Robot soccer keepaway (SSK2005) Present Continuous Tile coding (13) Robot obstacle negotiation (LSYSN2006) Present Continuous Linear (10) Optimized trade execution (NFK2007) Present Discrete Tabular (2-5) Blimp control (RPHB2007) Present Continuous Gaussian Process (2) 9 9 Go (SSM2007) Absent Discrete Linear ( 1.5 million) Ms. Pac-Man (SL2007) Absent Discrete Rule list (10) Autonomic resource allocation (TJDB2007) Present Continuous Neural network (2) General game playing (FB2008) Absent Discrete Tabular (part of state space) Soccer opponent hassling (GRT2009) Present Continuous Neural network (9) Adaptive epilepsy treatment (GVAP2008) Present Continuous Extremely rand. trees (114) Computer memory scheduling (IMMC2008) Absent Discrete Tile coding (6) Motor skills (PS2008) Present Continuous Motor primitive coeff. (100 s) Combustion Control (HNGK2009) Present Continuous Parameterized policy (2-3) Perfect representations (fully observable, enumerable states) are impractical. Shivaram Kalyanakrishnan 14/21

Today s Class 1. Markov Decision Problems 2. Planning and learning 3. Deep Reinforcement Learning 4. Summary Shivaram Kalyanakrishnan 15/21

Typical Neural Network-based Representation of Q 1. http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg Shivaram Kalyanakrishnan 16/21

ATARI 2600 Games (MKSRVBGRFOPBSAKKWLH2015) [Breakout video 1 ] 1. http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv2.mov Shivaram Kalyanakrishnan 17/21

ATARI 2600 Games (MKSRVBGRFOPBSAKKWLH2015) [Breakout video 1 ] 1. http://www.nature.com/nature/journal/v518/n7540/extref/nature14236-sv2.mov Shivaram Kalyanakrishnan 17/21

AlphaGo (SHMGSDSAPLDGNKSLLKGH2016) March 2016: DeepMind s program beats Go champion Lee Sedol 4-1. 1. http://www.kurzweilai.net/images/alphago-vs.-sedol.jpg Shivaram Kalyanakrishnan 18/21

AlphaGo (SHMGSDSAPLDGNKSLLKGH2016) 1. http://static1.uk.businessinsider.com/image/56e0373052bcd05b008b5217-810-602/ screen%20shot%202016-03-09%20at%2014.png Shivaram Kalyanakrishnan 18/21

Learning Algorithm: Batch Q-learning 1. Represent action value function Q as a neural network. 2. Gather data (on the simulator) by taking ɛ-greedy actions w.r.t. Q: (s 1, a 1, r 1, s 2, a 2, r 2, s 3, a 3, r 3,... s D, a D, r D, s D+1 ). 3. Train the network such that Q(s t, a t) r t + max a Q(s t+1, a). Go to 2. Shivaram Kalyanakrishnan 19/21

Learning Algorithm: Batch Q-learning 1. Represent action value function Q as a neural network. AlphaGo: Use both a policy network and an action value network. 2. Gather data (on the simulator) by taking ɛ-greedy actions w.r.t. Q: (s 1, a 1, r 1, s 2, a 2, r 2, s 3, a 3, r 3,... s D, a D, r D, s D+1 ). AlphaGo: Use Monte Carlo Tree Search for action selection 3. Train the network such that Q(s t, a t) r t + max a Q(s t+1, a). Go to 2. AlphaGo: Trained using self-play. Shivaram Kalyanakrishnan 19/21

Today s Class 1. Markov Decision Problems 2. Planning and learning 3. Deep Reinforcement Learning 4. Summary Shivaram Kalyanakrishnan 20/21

Summary Learning by trial and error to perform sequential decision making. Do not program behaviour! Rather, specify goals. Rich history, at confluence of several fields of study, firm foundation. Given an MDP (S, A, T, R, γ), we have to find a policy π : S A that yields high expected long-term reward from states. An optimal value function V exists, and it induces an optimal policy π (several optimal policies might exist). Under planning, we are given S, A, T, R, and γ. We may compute V and π using a dynamic programming algorithm such as policy iteration. In the learning context, we are given S, A, and γ: we may sample T and R in a sequential manner. We can still converge to V and π by applying a temporal difference learning method such as Q-learning. Limited in practice by quality of the representation used. Deep neural networks address the representation problem in some domains, and have yielded impressive results. Shivaram Kalyanakrishnan 21/21