Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Size: px

Start display at page:

Download "Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)"

Abraham Reeves
6 years ago
Views:

1 Function Approximation in Reinforcement Learning Gordon Geo November 5, 999

2 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

3 Backgammon Object: move all pieces o rst Opponent can block you or send you backwards

4 Backgammon cont'd 30 pieces (5 each for you and opponent) 24 points + bar + o = 26 locations 2 dice (36 possible rolls) Rough (over)estimate: : states Branching factor: 36 about 20 Complications: gammon, backgammon, doubling cube

5 TD-Gammon [Tesauro] Neural network with hidden layer 200 inputs: board position hidden units output: win probability Sigmoids on all hidden, output units (range [0 ])

6 >< >: win loss 0 Using RL Win probability is value function if: 8 Reinforcement = 0 game not over Discount is = weights for network should approximately satisfy Bellman So equation Warning: devil is in details! For now: ignore details, use TD(0)

7 Write v(xjw) for output of net with inputs x and weights w Given a transition x a r y Update w := w old + (r + v(yjw old ) ; v(xjw old )) r w v(xjw)j wold Review of TD(0) Ignores a (assumes it was chosen greedily)

8 TD(0) is not gradient descent Max isn't dierentiable Haven't said how to choose transition (won't be independent) Step direction is not a full gradient

9 Write Error(x r yjw) =(r + v(yjw) ; v(xjw)) 2 Write Error(x r yjw w 2 )=(r + v(yjw 2 ) ; v(xjw )) 2 TD uses r w Error(x r yjw w old )j wold GD would use r w Error(x r yjw)j wold Bellman error

10 Raw: Are there at least k of my pieces at location x? Representation Raw board or raw + expert features Are there at least k opposing pieces at location x? Is it my turn? My opponent's? For k>k max, extra pieces added to last unit's activation k max is 4 for points, for bar and o All inputs scaled to lie approximately in [0 ]

11 A useful trick position+die roll position position+die roll Die rolls aren't represented Implicit in lookahead choose random

12 Training Start from zero knowledge Self play (up to.5 million games) Data: state, die roll, action, reinforcement, state,... Compress to: state, state,..., state, win/loss

13 Results features give pretty good player (close to best computer Raw at time) player Raw+expert features give world-class player Absolute probabilities not so hot (0% error) Relative probabilities very good

14 What's easy about RL in backgammon Random strong eective discount kick out of local minima Guaranteed to terminate Perfect model, no hidden state Self-play allows huge training set

15 function approximator Nonlinear usual worry of local minima data distribution Changing changing policy causes change in importance of states What's hard about RL in backgammon Game (minimax instead of min or max) complete divergence is also possible [Van Roy] could get stuck (checkers, go) could oscillate forever wrong distribution could cause divergence [Gordon, Baird]

16 Huge variety of approximate RL algorithms Incremental vs. batch Model-based vs. direct Kind of function approximation Control vs. estimation State-values vs. action-values (V vs. Q) Vs. no values (policy only)

17 One fundamental question What does \approximate solution to Bellman equations" mean? Subquestion: how do we nd one?

18 Possible answers Bellman equations are a red herring Minimum squared Bellman error Minimum weighted squared Bellman error Relaxation of Bellman equations satisfy at some states other

19 Fit a line to each row: v(i j)=a j + b j i Minimum SBE example 3 00 grid of states

20 Minimum SBE example cont'd Exact Min SBE Min weighted SBE

21 action a Fixed policy Measures how important (x a) is to value of Flows F (x a) = expected # of times we visit state x and perform Fixed starting frequencies f 0 (x)

22 Visit (s a) at steps 2 and 7, discounted ow is Flows cont'd If <, use discounted ows For all, F satises conservation equation 0 X A F (x a) f0 (x)+ X y a F (y a)p (xjy a) a

23 F = F Aside: Can nd optimal ows using nonnegativity, conservation, and of reinforcement maximization This is dual (in sense of LP) to solving Bellman equations Optimal ows are optimal ows Optimal ows determine optimal policy maximizes total expected discounted reinforcement F X E(r(x a))f (x a) x a

24 Choosing weights Optimal ows would be good weights If just one policy, can nd ows Otherwise, chicken and egg problem Iterative improvement? Yes, but none known to converge. Research question: EM algorithm?

25 ij is probability of going to state j from state i, r i is expected p reinforcement state is i Matrix form of TD(0) Assume one policy, n states Write P for transition probability matrix (n n) Write r for reinforcement vector (n ) P i (ith row of P ) is distribution of next state given that current

26 Write E = P ; I Matrix form of TD(0) cont'd Write v for value function (n ) v i is value of state i (Pv + r) ; v is vector of Bellman errors Bellman equation is (P ; I)v + r =0

27 Assume w = 0 corresponds to v(x) = 0 for all x Matrix form of TD(0) cont'd Assume v is linear in parameters w (k ) r w v i is a constant k-vector (call it A i ) r w v is a constant n k matrix (call it A)

28 = X x = X x = X x = X x P (x)a x (r x + w X y Expected update E((r + v(yjw) ; v(xjw))r w v(xjw)) P (x)e((r + v(yjw) ; v(xjw))a x jx) P (x)a x (E(rjx)+E(v(yjw x)) ; v(xjw)) P (x)a x (r x + E(A y wjx) ; A x w) P (yjx)a y ; A x w)

29 P (x)a x (r x + w X y = X x = X x = X x Expected update cont'd X P (yjx)a y ; A x w) x P (x)a x (r x + w (P T x A) ; A x w) P (x)a x (r x +(P T x A ; A x) w) P (x)a x (r x +(P x ; U x ) T A w) = (DA) T (R +(P ; I)Aw)

30 Data x r x 2 r 2 ::: TD(0) uses -step error (r + x 2 ) ; x Could use 2-step error (r + r x 3 ) ; x TD() weights k-step error proportional to k TD() As!, approaches supervised learning

31 So (I ; A T DEA) has radius < for small enough of expectation follows tricky stochastic argument Convergence show that random perturbations don't kill convergence [Dayan, to Convergence Expected update is A T D(EAw + R) Can write w := (I ; A T DEA)w old + k +error Sutton (988) proved that A T DEA is positive denite So 9 a norm in which (I ; A T DEA) is a contraction Jaakola, Jordan, Singh, Tsitsiklis, Van Roy]

32 TD(0) and min weighted SBE TD(0) does not nd minimum of weighted SBE, but close TD(0) solves A T D(EAw + R) =0 Min WSBE satises (DEA) T D(EAw + R) =0 solving TD eqs in batch mode is called LS-TD (for least Note: even though it's not minimizing squared anything) [Bradtke squares, & Barto]

33 If k parameters, might expect m = k Collocation Satisfy Bellman equations exactly at m states But Bellman equations are nonlinear, so in general might need more or fewer than k equations solution might not be unique hard to nd

34 Averagers For some function approximators [Gordon, Van Roy] m = k works solution is unique easy to nd Called averagers linear interpolation on mesh, multilinear interpolation Examples: grid, k-nearest-neighbor on

35 Pick a sample X Remember v(x) only for x 2 X Fitted value iteration nd v(x) for x 62 X use k-nearest-neighbor, linear interpolation, To... One step of tted VI has two parts: Do a step of exact VI Replace result with a cheap approximation

36 Approximate with v(x) = v(y) = w Approximators as mappings v(y) (,5) (3,3) v(x) MDP with 2 states, so value fn is 2D: (v(x) v(y))

37 7! M A 7! M A Approximators as mappings

38 7! M A 7! M A Exaggeration Two similar target value functions Larger dierence between tted functions Exaggeration = max-norm expansion

39 Goal Hill-car Forward Reverse Gravity tries to drive up hill, but engine is weak so must back up to Car momentum get

40 Results Approximate Exact

41 Results the hard way

42 renement (use ows, policy uncertainty, and value Adaptive to decide where to split) [Munos & Moore] uncertainty Interesting topics left uncovered Hidden state (similar problems to fn approx) Eligibility traces Policy iteration and -PI Actor-critic architectures Duality and RL Stopping problems

43 6 +7 Bonus extra slide: RL counterexample with values (0 0 0) Start one backup: (0 ) Do a line: (; ) Fit another backup: (0 + 7 Do 6 ) For 6 7, divergence

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.