Introduction of Reinforcement Learning

Similar documents
Deep Reinforcement Learning. Scratching the surface

Proximal Policy Optimization (PPO)

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning and NLP

Policy Gradient Methods. February 13, 2017

Reinforcement Learning for NLP

Reinforcement Learning

Approximation Methods in Reinforcement Learning

CS599 Lecture 1 Introduction To RL

(Deep) Reinforcement Learning

Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Trust Region Policy Optimization

CS230: Lecture 9 Deep Reinforcement Learning

Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning

Reinforcement Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Branes with Brains. Reinforcement learning in the landscape of intersecting brane worlds. String_Data 2017, Boston 11/30/2017

David Silver, Google DeepMind

Lecture 8: Policy Gradient

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

REINFORCEMENT LEARNING

Q-Learning in Continuous State Action Spaces

CSC321 Lecture 22: Q-Learning

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

Reinforcement Learning

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Approximate Q-Learning. Dan Weld / University of Washington

CS 570: Machine Learning Seminar. Fall 2016

Grundlagen der Künstlichen Intelligenz

Lecture 10 - Planning under Uncertainty (III)

Variance Reduction for Policy Gradient Methods. March 13, 2017

Basics of reinforcement learning

Deep Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

15-780: ReinforcementLearning

6 Reinforcement Learning

Lecture 7: Value Function Approximation

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Discrete Latent Variable Models

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Machine Learning I Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Lecture 23: Reinforcement Learning

Lecture 8: Policy Gradient I 2

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Lecture 1: March 7, 2018

Reinforcement Learning

Human-level control through deep reinforcement. Liia Butler

Reinforcement Learning Part 2

Counterfactual Multi-Agent Policy Gradients

Deep learning attracts lots of attention.

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Reinforcement Learning. Introduction

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Introduction to Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Deep Reinforcement Learning via Policy Optimization

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Multiagent (Deep) Reinforcement Learning

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Deep Learning. Hung-yi Lee 李宏毅

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Q-learning. Tambet Matiisen

Reinforcement Learning. George Konidaris

Machine Learning I Continuous Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

CS Machine Learning Qualifying Exam

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Tutorial on Policy Gradient Methods. Jan Peters

CS 188: Artificial Intelligence. Outline

Replacing eligibility trace for action-value learning with function approximation

Reinforcement Learning

Reinforcement Learning: An Introduction

Transcription:

Introduction of Reinforcement Learning

Deep Reinforcement Learning

Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.ht ml (10 lectures, around 1:30 each) http://videolectures.net/rldm2015_silver_reinforcemen t_learning/ (Deep Reinforcement Learning ) Lectures of John Schulman https://youtu.be/aurx-rp_ss4

Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

Scenario of Reinforcement Learning Observation State Agent Agent learns to take actions maximizing expected reward. Action Change the environment Thank you. Reward https://yoast.com/howto-clean-site-structure/ Environment

Machine Learning Looking for a Function Observation Function input Actor/Policy Action = π( Observation ) Action Function output Used to pick the best function Reward Environment

Learning to play Go Observation Action Reward Next Move Environment

Learning to play Go Observation Agent learns to take actions maximizing expected reward. Action Reward reward = 0 in most cases If win, reward = 1 If loss, reward = -1 Environment

Learning to play Go Supervised: Learning from teacher Next move: 5-5 Next move: 3-3 Reinforcement Learning Learning from experience First move many moves Win! (Two agents play with each other.) Alpha Go is supervised learning + reinforcement learning.

Learning a chat-bot https://image.freepik.com/free-vector/variety-of-humanavatars_23-2147506285.jpg http://www.freepik.com/free-vector/variety-of-humanavatars_766615.htm Machine obtains feedback from user How are you? Bye bye Hello Hi -10 3 Chat-bot learns to maximize the expected reward

Learning a chat-bot Let two agents talk to each other (sometimes generate good dialogue, sometimes bad) How old are you? How old are you? See you. See you. I am 16. I though you were 12. See you. What make you think so?

Learning a chat-bot By this approach, we can generate a lot of dialogues. Use some pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4 Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8 Machine learns from the evaluation Deep Reinforcement Learning for Dialogue Generation https://arxiv.org/pdf/1606.01541v3.pdf

Learning a chat-bot Supervised Reinforcement Hello Bye bye Say Hi Say Good bye.. Hello Bad Agent Agent

More applications Flying Helicopter https://www.youtube.com/watch?v=0jl04jjjocc Driving https://www.youtube.com/watch?v=0xo1ldx3l5q Robot https://www.youtube.com/watch?v=370ct-oazzm Google Cuts Its Giant Electricity Bill With DeepMind- Powered AI http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-itsgiant-electricity-bill-with-deepmind-powered-ai Text generation https://www.youtube.com/watch?v=pbq4qe8ewlo

Example: Playing Video Game Widely studies: Gym: https://gym.openai.com/ Universe: https://openai.com/blog/universe/ Machine learns to play video games as human players What machine observes is pixels Machine learns to take proper action itself

Example: Playing Video Game Space invader Score (reward) Termination: all the aliens are killed, or your spaceship is destroyed. Kill the aliens shield fire

Example: Playing Video Game Space invader Play yourself: http://www.2600online.com/spaceinvaders.htm l How about machine: https://gym.openai.com/evaluations/eval_eduo zx4hryqgtcvk9ltw

Example: Playing Video Game Start with observation s 1 Observation s 2 Observation s 3 Obtain reward r 1 = 0 Action a 1 : right Obtain reward r 2 = 5 Action a 2 : fire (kill an alien) Usually there is some randomness in the environment

Example: Playing Video Game Start with observation s 1 Observation s 2 Observation s 3 After many turns Game Over (spaceship destroyed) Obtain reward r T This is an episode. Learn to maximize the expected cumulative reward per episode Action a T

Properties of Reinforcement Learning Reward delay In space invader, only fire obtains reward Although the moving before fire is important In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward Agent s actions affect the subsequent data it receives E.g. Exploration

Outline Alpha Go: policy-based + value-based + model-based Model-free Approach Policy-based Value-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach

Policy-based Approach Learning an Actor

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Neural network as Actor Input of neural network: the observation of machine represented as a vector or a matrix Output neural network : each action corresponds to a neuron in output layer pixels NN as actor left right fire Probability of taking the action What is the benefit of using network instead of lookup table? generalization 0.7 0.2 0.1

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Goodness of Actor Review: Supervised learning Training Example 1 Total Loss: N L = n=1 l n x 1 x 2 x 256 Given a set of parameters θ Softmax y 1 y 2 y 10 As close as possible Loss l 1 0 target 0

Goodness of Actor Given an actor π θ s with network parameter θ Use the actor π θ s to play the video game Start with observation s 1 Machine decides to take a 1 Machine obtains reward r 1 Machine sees observation s 2 Machine decides to take a 2 Machine obtains reward r 2 Machine sees observation s 3 Machine decides to take a T Machine obtains reward r T END Total reward: R θ = σt t=1 Even with the same actor, R θ is different each time We define തR θ as the expected value of R θ തR θ evaluates the goodness of an actor π θ s r t Randomness in the actor and the game

Goodness of Actor We define തR θ as the expected value of R θ τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T P τ θ = p s 1 p a 1 s 1, θ p r 1, s 2 s 1, a 1 p a 2 s 2, θ p r 2, s 3 s 2, a 2 T = p s 1 t=1 p a t s t, θ p r t, s t+1 s t, a t not related to your actor Control by your actor π θ p a t = "fire" s t, θ = 0.7 left 0.1 Actor s right t π θ 0.2 fire 0.7

Goodness of Actor An episode is considered as a trajectory τ τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T R τ = σt t=1 r t If you use an actor to play the game, each τ has a probability to be sampled The probability depends on actor parameter θ: P τ θ തR θ = R τ P τ θ τ Sum over all possible trajectory N 1 N n=1 R τ n Use π θ to play the game N times, obtain τ 1, τ 2,, τ N Sampling τ from P τ θ N times

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Gradient Ascent Problem statement θ = arg max θ തR θ Gradient ascent Start with θ 0 θ 1 θ 0 + η തR θ 0 θ 2 θ 1 + η തR θ 1 θ = w 1, w 2,, b 1, തR θ Τ w 1 തR θ = തR θ Τ w 2 തR θ Τ b 1

Policy Gradient തR θ = τ R τ P τ θ തR θ =? തR θ = R τ P τ θ τ = τ R τ P τ θ P τ θ P τ θ R τ do not have to be differentiable It can even be a black box. = R τ P τ θ logp τ θ τ dlog f x dx = 1 f x df x dx N 1 N n=1 R τ n logp τ n θ Use π θ to play the game N times, Obtain τ 1, τ 2,, τ N

Policy Gradient logp τ θ =? τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T P τ θ = p s 1 logp τ θ T = logp s 1 + t=1 T T t=1 logp τ θ = t=1 p a t s t, θ p r t, s t+1 s t, a t logp a t s t, θ + logp r t, s t+1 s t, a t logp a t s t, θ Ignore the terms not related to θ

Policy Gradient θ new θ old + η തR θ old N തR θ 1 N n=1 N = 1 N n=1 R τ n logp τ n θ T n N = 1 N n=1 R τ n logp a n t s n t, θ t=1 T logp τ θ = logp a t s t, θ t=1 R τ n T n logp a n t s n t, θ If in τ n machine takes a t n when seeing s t n in R τ n is positive Tuning θ to increase p a t n s t n t=1 What if we replace R τ n with r n t R τ n is negative Tuning θ to decrease p a t n s t n It is very important to consider the cumulative reward R τ n the whole trajectory τ n n instead of immediate reward r t of

Policy Gradient Given actor parameter θ τ 1 : s 1 1, a 1 1 R τ 1 s 1 1 2, a 2 R τ 1 τ 2 : s 1 2, a 1 2 R τ 2 s 2 2 2, a 2 R τ 2 θ θ + η തR θ തR θ = 1 N N n=1 T n R τ n logp a n t s n t, θ t=1 Update Model Data Collection

Policy Gradient Considered as Classification Problem Minimize: 3 i=1 y i logy i s y i left right fire y i 1 0 0 θ θ + η logp "left" s Maximize: logy i = logp "left" s

Policy Gradient Given actor parameter θ τ 1 : s 1 1, a 1 1 R τ 1 s 1 1 2, a 2 R τ 1 τ 2 : s 1 2, a 1 2 R τ 2 s 2 2 2, a 2 R τ 2 s 1 1 θ θ + η തR θ തR θ = 1 N N n=1 T n R τ n logp a n t s n t, θ t=1 NN left right fire a 1 1 = left 1 0 0 a 1 2 = fire s 1 2 NN left right fire 0 0 1

Policy Gradient Given actor parameter θ τ 1 : s 1 1, a 1 1 R τ 1 s 2 1, a 2 1 R τ 1 τ 2 : s 1 2, a 1 2 R τ 2 s 2 2, a 2 2 R τ 2 2 2 1 θ θ + η തR θ തR θ = 1 N N n=1 s 1 1 1 s 1 1 T n R τ n logp a n t s n t, θ t=1 Each training data is weighted by R τ n NN NN a 1 1 = left a 1 1 = left s 1 2 NN a 1 2 = fire

Add a Baseline It is possible that R τ n is always positive. θ new θ old + η തR θ old 1 NN തR θ 1 N n=1 TT nn RR τ n τ n logp b logp a n t s n t, aθ n t s n t, θ t=1 Ideal case Sampling It is probability a b c a b c Not sampled The probability of the actions not sampled will decrease. a b c a b c

Value-based Approach Learning a Critic

Critic A critic does not determine the action. Given an actor π, it evaluates the how good the actor is An actor can be found from a critic. e.g. Q-learning http://combiboilersleeds.com/picaso/critics/critics-4.html

Critic State value function V π s When using actor π, the cumulated reward expects to be obtained after seeing observation (state) s s V π V π s scalar V π s is large V π s is smaller

Critic V 以前的阿光大馬步飛 = bad V 變強的阿光大馬步飛 = good

How to estimate V π s Monte-Carlo based approach The critic watches π playing the game After seeing s a, Until the end of the episode, the cumulated reward is G a s a V π V π s a G a After seeing s b, Until the end of the episode, the cumulated reward is G b s b V π V π s b G b

How to estimate V π s Temporal-difference approach s t, a t, r t, s t+1 s t V π V π s t V π s t + r t = V π s t+1 - V π s t V π s t+1 r t s t+1 V π V π s t+1 Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

MC v.s. TD V π V π Larger variance s a s a G a unbiased V π s V π t s t r + V π s t+1 V π s t+1 Smaller variance May be biased

MC v.s. TD [Sutton, v2, Example 6.4] The critic has the following 8 episodes s a, r = 0, s b, r = 0, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 0, END V π s b = 3/4 V π s a =? Monte-Carlo: Temporal-difference: 0? 3/4? V π s a = 0 (The actions are ignored here.) V π s b + r = V π s a 3/4 0 3/4

Another Critic State-action value function Q π s, a When using actor π, the cumulated reward expects to be obtained after seeing observation s and taking a s a Q π Q π s, a scalar s Q π Q π s, a = left Q π s, a = right Q π s, a = fire for discrete action only

Q-Learning π interacts with the environment π = π TD or MC Find a new actor π better than π Learning Q π s, a?

Q-Learning Given Q π s, a, find a new actor π better than π Better : V π s V π s, for all state s π s = arg max a Qπ s, a π does not have extra parameters. It depends on Q Not suitable for continuous action a

Deep Reinforcement Learning Actor-Critic

Actor-Critic π interacts with the environment π = π TD or MC Update actor from π π based on Q π s, a, V π s Learning Q π s, a, V π s

Actor-Critic Tips The parameters of actor π s can be shared and critic V π s s Network Network Network left right fire V π s

Asynchronous Source of image: https://medium.com/emergentfuture/simple-reinforcement-learning-withtensorflow-part-8-asynchronous-actor-criticagents-a3c-c88f72a5e9f2#.68x6na7o9 1. Copy global parameters 2. Sampling some data 3. Compute gradients 4. Update global models θ θ 1 θ θ 1 θ 1 θ 2 +η θ (other workers also update models)

Demo of A3C https://www.youtube.com/watch?v=0xo1ldx3l5q Racing Car (DeepMind)

Demo of A3C Visual Doom AI Competition @ CIG 2016 https://www.youtube.com/watch?v=94epsjqh38y

Concluding Remarks Model-free Approach Policy-based Value-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach