Deep Reinforcement Learning. Scratching the surface

Deep Reinforcement Learning Scratching the surface

Deep Reinforcement Learning

Scenario of Reinforcement Learning Observation State Agent Action Change the environment Don t do that Reward Environment

Scenario of Reinforcement Learning Observation State Agent Agent learns to take actions to maximize expected reward. Action Change the environment Thank you. Reward https://yoast.com/howto-clean-site-structure/ Environment

Learning to paly Go Observation Action Reward Next Move Environment

Learning to paly Go Observation Agent learns to take actions to maximize expected reward. Action Reward reward = 0 in most cases If win, reward = 1 If loss, reward = -1 Environment

Learning to paly Go - Supervised v.s. Reinforcement Supervised: Learning from teacher Next move: 5-5 Next move: 3-3 Reinforcement Learning Learning from experience First move many moves Win! (Two agents play with each other.) Alpha Go is supervised learning + reinforcement learning.

Learning a chat-bot Sequence-to-sequence learning

Learning a chat-bot - Supervised v.s. Reinforcement Supervised Reinforcement Hello Bye bye Say Hi Say Good bye.. Hello Bad Agent Agent

Learning a chat-bot - Reinforcement Learning Let two agents talk to each other (sometimes generate good dialogue, sometimes bad) How old are you? How old are you? See you. See you. I am 16. I though you were 12. See you. What make you think so?

Learning a chat-bot - Reinforcement Learning By this approach, we can generate a lot of dialogues. Use some pre-defined rules to evaluate the goodness of a dialogue Machine learns from the evaluation Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4 Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8 Deep Reinforcement Learning for Dialogue Generation https://arxiv.org/pdf/1606.01541v3.pdf

More applications Interactive retrieval US President [Wu & Lee, INTERSPEECH 16] user More precisely, please. Trump Is it related to Election? Yes. Here are what you are looking for. I see!

More applications Flying Helicopter https://www.youtube.com/watch?v=0jl04jjjocc Driving https://www.youtube.com/watch?v=0xo1ldx3l5q Google Cuts Its Giant Electricity Bill With DeepMind- Powered AI http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-itsgiant-electricity-bill-with-deepmind-powered-ai Text generation Hongyu Guo, Generating Text with Deep Reinforcement Learning, NIPS, 2015 Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba, Sequence Level Training with Recurrent Neural Networks, ICLR, 2016

Example: Playing Video Game Widely studies: Gym: https://gym.openai.com/ Universe: https://openai.com/blog/universe/ Machine learns to play video games as human players What machine observes is pixels Machine learns to take proper action itself

Example: Playing Video Game Space invader Score (reward) Termination: all the aliens are killed, or your spaceship is destroyed. Kill the aliens shield fire

Example: Playing Video Game Space invader Play yourself: http://www.2600online.com/spaceinvaders.htm l How about machine: https://gym.openai.com/evaluations/eval_eduo zx4hryqgtcvk9ltw

Example: Playing Video Game Start with observation s 1 Observation s 2 Observation s 3 Obtain reward r 1 = 0 Action a 1 : right Obtain reward r 2 = 5 Action a 2 : fire (kill an alien) Usually there is some randomness in the environment

Example: Playing Video Game Start with observation s 1 Observation s 2 Observation s 3 After many turns Game Over (spaceship destroyed) Obtain reward r T This is an episode. Learn to maximize the expected cumulative reward per episode Action a T

Difficulties of Reinforcement Learning Reward delay In space invader, only fire obtains reward Although the moving before fire is important In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward Agent s actions affect the subsequent data it receives E.g. Exploration

Outline Alpha Go: policy-based + value-based + model-based Policy-based Value-based Learning an Actor Actor + Critic Learning a Critic Asynchronous Advantage Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu, Asynchronous Methods for Deep Reinforcement Learning, ICML, 2016

To learn deep reinforcement learning Textbook: Reinforcement Learning: An Introduction https://webdocs.cs.ualberta.ca/~sutton/book/thebook.html Lectures of David Silver http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.ht ml (10 lectures, 1:30 each) http://videolectures.net/rldm2015_silver_reinforcemen t_learning/ (Deep Reinforcement Learning ) Lectures of John Schulman https://youtu.be/aurx-rp_ss4

Policy-based Approach Learning an Actor

Machine Learning Looking for a Function Observation Function input Actor/Policy Action = π( Observation ) Action Function output Used to pick the best function Reward Environment

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Neural network as Actor Input of neural network: the observation of machine represented as a vector or a matrix Output neural network : each action corresponds to a neuron in output layer pixels NN as actor left right fire Probability of taking the action What is the benefit of using network instead of lookup table? generalization 0.7 0.2 0.1

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Goodness of Actor Total Loss: N L = n=1 l n Review: Supervised learning Training Example Find the network parameters θ that minimize total loss L 1 x 1 x 2 x 256 Given a set of parameters θ Softmax y 1 y 2 y 10 As close as possible Loss l 1 0 target 0

Goodness of Actor Given an actor π θ s with network parameter θ Use the actor π θ s to play the video game Start with observation s 1 Machine decides to take a 1 Machine obtains reward r 1 Machine sees observation s 2 Machine decides to take a 2 Machine obtains reward r 2 Machine sees observation s 3 Machine decides to take a T Machine obtains reward r T END Total reward: R θ = σt t=1 Even with the same actor, R θ is different each time We define തR θ as the expected value of R θ തR θ evaluates the goodness of an actor π θ s r t Randomness in the actor and the game

Goodness of Actor An episode is considered as a trajectory τ τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T R τ = σt t=1 r t If you use an actor to play the game, each τ has a probability to be sampled The probability depends on actor parameter θ: P τ θ തR θ = R τ P τ θ τ Sum over all possible trajectory N 1 N n=1 R τ n Use π θ to play the game N times, obtain τ 1, τ 2,, τ N Sampling τ from P τ θ N times

Three Steps for Deep Learning Step 1: define a set of function Neural Network as Actor Step 2: goodness of function Step 3: pick the best function Deep Learning is so simple

Gradient Ascent Problem statement θ = arg max θ Gradient ascent Start with θ 0 θ 1 θ 0 + η തR θ 0 θ 2 θ 1 + η തR θ 1 തR θ തR θ = R τ P τ θ τ θ = w 1, w 2,, b 1, തR θ Τ w 1 തR θ = തR θ Τ w 2 തR θ Τ b 1

Gradient Ascent തR θ = R τ P τ θ തR θ =? τ തR θ = R τ P τ θ τ = R τ P τ θ R τ do not have to be differentiable It can even be a black box. τ P τ θ P τ θ = R τ P τ θ logp τ θ τ dlog f x dx = 1 f x df x dx N 1 N n=1 R τ n logp τ n θ Use π θ to play the game N times, Obtain τ 1, τ 2,, τ N

Gradient Ascent logp τ θ =? τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T P τ θ = p s 1 p a 1 s 1, θ p r 1, s 2 s 1, a 1 p a 2 s 2, θ p r 2, s 3 s 2, a 2 T = p s 1 t=1 p a t s t, θ p r t, s t+1 s t, a t not related to your actor Control by your actor π θ p a t = "fire" s t, θ = 0.7 left 0.1 Actor s right t π θ 0.2 fire 0.7

Gradient Ascent logp τ θ =? τ = s 1, a 1, r 1, s 2, a 2, r 2,, s T, a T, r T P τ θ = p s 1 logp τ θ T = logp s 1 + t=1 T T t=1 logp τ θ = t=1 p a t s t, θ p r t, s t+1 s t, a t logp a t s t, θ + logp r t, s t+1 s t, a t logp a t s t, θ Ignore the terms not related to θ

Gradient Ascent θ new θ old + η തR θ old N തR θ 1 N n=1 N = 1 N n=1 R τ n logp τ n θ T n N = 1 N n=1 R τ n logp a n t s n t, θ t=1 T logp τ θ = logp a t s t, θ t=1 R τ n T n logp a n t s n t, θ If in τ n machine takes a t n when seeing s t n in R τ n is positive Tuning θ to increase p a t n s t n t=1 What if we replace R τ n with r n t R τ n is negative Tuning θ to decrease p a t n s t n It is very important to consider the cumulative reward R τ n the whole trajectory τ n n instead of immediate reward r t of

Gradient Ascent θ new θ old + η തR θ old N N തR θ 1 N R τ n logp τ n θ = 1 N R τ n n=1 n=1 N = 1 N n=1 T n R τ n logp a n t s n t, θ t=1 T logp τ θ = logp a t s t, θ t=1 T n logp a n t s n t, θ t=1 p a n t s n t, θ p a n t s n t, θ Why divided by p a n t s n t, θ? e.g. in the sampling data s has been seen in τ 13, τ 15, τ 17, τ 33 In τ 13, take action a R τ 13 = 2 In τ 15, take action b R τ 15 = 1 In τ 17, take action b R τ 17 = 1 In τ 33, take action b R τ 33 = 1

Add a Baseline It is possible that R τ n is always positive. θ new θ old + η തR θ old 1 NN തR θ 1 N n=1 TT nn RR τ n τ n logp b logp a n t s n t, aθ n t s n t, θ t=1 Ideal case Sampling It is probability a b c a b c Not sampled The probability of the actions not sampled will decrease. a b c a b c

Value-based Approach Learning a Critic

Critic A critic does not determine the action. Given an actor, it evaluates the how good the actor is An actor can be found from a critic. e.g. Q-learning (not today) http://combiboilersleeds.com/picaso/critics/critics-4.html

Three kinds of Critics A critic is a function depending on the actor π it is evaluated The function is represented by a neural network State value function V π s When using actor π, the cumulated reward expects to be obtained after seeing observation (state) s s V π V π s scalar V π s is large V π s is smaller

Three kinds of Critics State-action value function Q π s, a When using actor π, the cumulated reward expects to be obtained after seeing observation s and taking a s a Q π Q π s, a scalar s a Q π Q π s, a = left Q π s, a = right Q π s, a = fire for discrete action only

How to estimate V π s Monte-Carlo based approach The critic watches π playing the game After seeing s a, Until the end of the episode, the cumulated reward is G a s a V π V π s a G a After seeing s b, Until the end of the episode, the cumulated reward is G b s b V π V π s b G b

How to estimate V π s Temporal-difference approach s a, a, r, s b s a V π V π s a V π s a + r = V π s b - V π s b V π s a r s b V π V π s b Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

How to estimate V π s [Sutton, v2, Example 6.4] The critic has the following 8 episodes s a, r = 0, s b, r = 0, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 1, END s b, r = 0, END V π s b = 3/4 V π s a =? Monte-Carlo: Temporal-difference: 0? 3/4? V π s a = 0 (The actions are ignored here.) V π s a + r = V π s b 3/4 0 3/4

Deep Reinforcement Learning Actor-Critic

Actor-Critic θ new θ old + η തR θ old N T n തR θ 1 N n=1 R τ n logp a t n s t n, θ Advantage Function: t=1 Evaluated by critic r n t V π θ s n t V π θ n s t+1 Baseline is added The reward r t n we truly obtain when taking action a t n Positive advantage function Negative advantage function Expected reward r t n we obtain if we use actor π θ Increasing the prob. of action a t n decreasing the prob. of action a t n

Actor-Critic Tips The parameters of actor π s can be shared Network s Network and critic V π s left right fire Network V π s Use output entropy as regularization for π s Larger entropy is preferred exploration

Asynchronous Source of image: https://medium.com/emergentfuture/simple-reinforcement-learning-withtensorflow-part-8-asynchronous-actor-criticagents-a3c-c88f72a5e9f2#.68x6na7o9 1. Copy global parameters 2. Sampling some data 3. Compute gradients 4. Update global models θ θ 1 θ θ 1 θ 1 θ 2 +η θ (other workers also update models)

Demo of A3C DeepMind https://www.youtube.com/watch?v=nmr5mjcfzcw

Demo of A3C DeepMind https://www.youtube.com/watch?v=0xo1ldx3l5q

Conclusion of This Semester

Learning Map Regression Linear Model Semi-supervised Learning Unsupervised Learning Transfer Learning Reinforcement Learning Deep SVM, decision Learning tree, K-NN Non-linear Model Classification Supervised Learning Structured Learning

Acknowledgment 感謝 Larry Guo 同學發現投影片上的符號錯誤