Actor-Critic. Hung-yi Lee

Size: px

Start display at page:

Download "Actor-Critic. Hung-yi Lee"

Audra Cameron
5 years ago
Views:

1 Actor-Critic Hung-yi Lee

2 Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous Methods for Deep Reinforcement Lerning, ICML, 2016

3 Review Policy Grdient N തR θ 1 N n=1 T n t=1 With sufficient smples, pproximte the expecttion of G. s Cn we estimte the expected vlue of G? bseline T n γ t t r n t t b logp θ n n t s t =t G n t : obtined vi interction Very unstble G = 100 G = 3 G = 1 G = 2 G = 10

4 Review Q-Lerning Stte vlue function V π s When using ctor π, the cumulted rewrd expects to be obtined fter visiting stte s Stte-ction vlue function Q π s, When using ctor π, the cumulted rewrd expects to be obtined fter tking t stte s for discrete ction only s V π V π s Q π s, = left s Q π Q π s, = right sclr Q π s, = fire Estimted by TD or MC

5 Actor-Critic Q π θ s t n, t n V π θ s t n V π θ s t n N തR θ 1 N n=1 T n t=1 bseline T n γ t t r n t b logp θ n n t s t t =t G t n : obtined vi interction E G t n = Q π θ s t n, t n

6 Advntge Actor-Critic Q π s t n, t n V π s t n Estimte two networks? We cn only estimte one. r t n + V π s t+1 n V π s t n Only estimte stte vlue A little bit vrince Q π s t n, t n Q π s t n, t n = E r n t + V π n s t+1 = r n t + V π n s t+1

7 Advntge Actor-Critic π intercts with the environment π = π TD or MC Updte ctor from π π bsed on V π s Lerning V π s N തR θ 1 N n=1 T n r n t + V π n s t+1 V π s n t logp θ n n t s t t=1

8 Advntge Actor-Critic Tips The prmeters of ctor π s cn be shred Network s Network nd critic V π s left right fire Network V π s Use output entropy s regulriztion for π s Lrger entropy is preferred explortion

9 Asynchronous Advntge Actor-Critic (A3C) The ide is from 李思叡

Asynchronous Source of imge: https://medium.

10 Asynchronous Source of imge: 1. Copy globl prmeters 2. Smpling some dt 3. Compute grdients 4. Updte globl models θ θ 1 θ θ 1 θ 1 θ 2 +η θ (other workers lso updte models)

11 Pthwise Derivtive Policy Grdient Dvid Silver, Guy Lever, Nicols Heess, Thoms Degris, Dn Wierstr, Mrtin Riedmiller, Deterministic Policy Grdient Algorithms, ICML, 2014 Timothy P. Lillicrp, Jonthn J. Hunt, Alexnder Pritzel, Nicols Heess, Tom Erez, Yuvl Tss, Dvid Silver, Dn Wierstr, CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING, ICLR, 2016

12 Another Wy to use Critic Originl Actor-critic Q π s, Pthwise derivtive policy grdient From Q function we know tht tking t stte s is better thn Q π s, 1 2 decrese increse We know the prmeters of Q function

13 Actor Critic Pthwise derivtive policy grdient Originl Actor-critic Action is continuous vector = rg mx Q s, s Actor π Actor s the solver of this optimiztion problem

14 Pthwise Derivtive Policy Grdient π s = rg mx Qπ s, is the output of n ctor Grdient scent: θ π θ π + η θ πq π s, Updte π π s Fixed Q π Q π s, s Actor π = This is lrge network

15 Explortion π = π π intercts with the environment TD or MC Reply Buffer Find new ctor π better thn π Lerning Q π s, θ π θ π + η θ πq π s, s Updte π π Actor π = s Q π Q π s,

16 Q-Lerning Algorithm Initilize Q-function Q, trget Q-function Q = Q, ctor π, trget ctor π = π In ech episode For ech time step t Given stte s t, tke ction t bsed on Q (explortion) Obtin rewrd r t, nd rech new stte s t+1 Store (s t, t, r t, s t+1 ) into buffer Smple (s i, i, r i, s i+1 ) from buffer (usully btch) Trget y = r i + mx Q s i+1, Updte the prmeters of Q to mke Q s i, i close to y (regression) Updte the prmeters of π to mximize Q s i,π s i Every C steps reset Q = Q Every C steps reset π = π

17 Q-Lerning Algorithm Pthwise Derivtive Policy Grdient Initilize Q-function Q, trget Q-function Q = Q, ctor π, trget ctor π = π In ech episode For ech time step t 1 Given stte s t, tke ction t bsed on Q π (explortion) Obtin rewrd r t, nd rech new stte s t+1 Store (s t, t, r t, s t+1 ) into buffer Smple (s i, i, r i, s i+1 ) from buffer (usully btch) 2 Trget y = r i + mx Q s i+1, Q s i+1, π s i+1 Updte the prmeters of Q to mke Q s i, i close to y (regression) 3 Updte the prmeters of π to mximize Q s i,π s i Every C steps reset Q = Q 4 Every C steps reset π = π

18 Connection with GAN Dvid Pfu, Oriol Vinyls, Connecting Genertive Adversril Networks nd Actor-Critic Methods, rxiv preprint, 2016

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic