Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999

Backgammon Object: move all pieces o rst Opponent can block you or send you backwards

Backgammon cont'd 30 pieces (5 each for you and opponent) 24 points + bar + o = 26 locations 2 dice (36 possible rolls) Rough (over)estimate: 30 26 36 2 :8 0 40 states Branching factor: 36 about 20 Complications: gammon, backgammon, doubling cube

TD-Gammon [Tesauro] Neural network with hidden layer 200 inputs: board position 40-80 hidden units output: win probability Sigmoids on all hidden, output units (range [0 ])

>< >: win loss 0 Using RL Win probability is value function if: 8 Reinforcement = 0 game not over Discount is = weights for network should approximately satisfy Bellman So equation Warning: devil is in details! For now: ignore details, use TD(0)

Write v(xjw) for output of net with inputs x and weights w Given a transition x a r y Update w := w old + (r + v(yjw old ) ; v(xjw old )) r w v(xjw)j wold Review of TD(0) Ignores a (assumes it was chosen greedily)

TD(0) is not gradient descent Max isn't dierentiable Haven't said how to choose transition (won't be independent) Step direction is not a full gradient

Write Error(x r yjw) =(r + v(yjw) ; v(xjw)) 2 Write Error(x r yjw w 2 )=(r + v(yjw 2 ) ; v(xjw )) 2 TD uses r w Error(x r yjw w old )j wold GD would use r w Error(x r yjw)j wold Bellman error

Raw: Are there at least k of my pieces at location x? Representation Raw board or raw + expert features Are there at least k opposing pieces at location x? Is it my turn? My opponent's? For k>k max, extra pieces added to last unit's activation k max is 4 for points, for bar and o All inputs scaled to lie approximately in [0 ]

A useful trick position+die roll position position+die roll Die rolls aren't represented Implicit in lookahead choose random

Training Start from zero knowledge Self play (up to.5 million games) Data: state, die roll, action, reinforcement, state,... Compress to: state, state,..., state, win/loss

Results features give pretty good player (close to best computer Raw at time) player Raw+expert features give world-class player Absolute probabilities not so hot (0% error) Relative probabilities very good

What's easy about RL in backgammon Random strong eective discount kick out of local minima Guaranteed to terminate Perfect model, no hidden state Self-play allows huge training set

function approximator Nonlinear usual worry of local minima data distribution Changing changing policy causes change in importance of states What's hard about RL in backgammon Game (minimax instead of min or max) complete divergence is also possible [Van Roy] could get stuck (checkers, go) could oscillate forever wrong distribution could cause divergence [Gordon, Baird]

Huge variety of approximate RL algorithms Incremental vs. batch Model-based vs. direct Kind of function approximation Control vs. estimation State-values vs. action-values (V vs. Q) Vs. no values (policy only)

One fundamental question What does \approximate solution to Bellman equations" mean? Subquestion: how do we nd one?

Possible answers Bellman equations are a red herring Minimum squared Bellman error Minimum weighted squared Bellman error Relaxation of Bellman equations satisfy at some states other

Fit a line to each row: v(i j)=a j + b j i Minimum SBE example 3 00 grid of states

Minimum SBE example cont'd 00 00 00 80 80 80 20 40 60 40 60 40 60 00 20 20 75 50 25 0.5 2.53 2 2 0 - -2.5 2.5 2 3 00 75 50 25 0.5.52 2.53 2 Exact Min SBE Min weighted SBE

action a Fixed policy Measures how important (x a) is to value of Flows F (x a) = expected # of times we visit state x and perform Fixed starting frequencies f 0 (x)

Visit (s a) at steps 2 and 7, discounted ow is 2 + 7 Flows cont'd If <, use discounted ows For all, F satises conservation equation 0 X A F (x a) = @ f0 (x)+ X y a F (y a)p (xjy a) a

F = F Aside: Can nd optimal ows using nonnegativity, conservation, and of reinforcement maximization This is dual (in sense of LP) to solving Bellman equations Optimal ows are optimal ows Optimal ows determine optimal policy maximizes total expected discounted reinforcement F X E(r(x a))f (x a) x a

Choosing weights Optimal ows would be good weights If just one policy, can nd ows Otherwise, chicken and egg problem Iterative improvement? Yes, but none known to converge. Research question: EM algorithm?

ij is probability of going to state j from state i, r i is expected p reinforcement state is i Matrix form of TD(0) Assume one policy, n states Write P for transition probability matrix (n n) Write r for reinforcement vector (n ) P i (ith row of P ) is distribution of next state given that current

Write E = P ; I Matrix form of TD(0) cont'd Write v for value function (n ) v i is value of state i (Pv + r) ; v is vector of Bellman errors Bellman equation is (P ; I)v + r =0

Assume w = 0 corresponds to v(x) = 0 for all x Matrix form of TD(0) cont'd Assume v is linear in parameters w (k ) r w v i is a constant k-vector (call it A i ) r w v is a constant n k matrix (call it A)

= X x = X x = X x = X x P (x)a x (r x + w X y Expected update E((r + v(yjw) ; v(xjw))r w v(xjw)) P (x)e((r + v(yjw) ; v(xjw))a x jx) P (x)a x (E(rjx)+E(v(yjw x)) ; v(xjw)) P (x)a x (r x + E(A y wjx) ; A x w) P (yjx)a y ; A x w)

P (x)a x (r x + w X y = X x = X x = X x Expected update cont'd X P (yjx)a y ; A x w) x P (x)a x (r x + w (P T x A) ; A x w) P (x)a x (r x +(P T x A ; A x) w) P (x)a x (r x +(P x ; U x ) T A w) = (DA) T (R +(P ; I)Aw)

Data x r x 2 r 2 ::: TD(0) uses -step error (r + x 2 ) ; x Could use 2-step error (r + r 2 + 2 x 3 ) ; x TD() weights k-step error proportional to k TD() As!, approaches supervised learning

So (I ; A T DEA) has radius < for small enough of expectation follows tricky stochastic argument Convergence show that random perturbations don't kill convergence [Dayan, to Convergence Expected update is A T D(EAw + R) Can write w := (I ; A T DEA)w old + k +error Sutton (988) proved that A T DEA is positive denite So 9 a norm in which (I ; A T DEA) is a contraction Jaakola, Jordan, Singh, Tsitsiklis, Van Roy]

TD(0) and min weighted SBE TD(0) does not nd minimum of weighted SBE, but close TD(0) solves A T D(EAw + R) =0 Min WSBE satises (DEA) T D(EAw + R) =0 solving TD eqs in batch mode is called LS-TD (for least Note: even though it's not minimizing squared anything) [Bradtke squares, & Barto]

If k parameters, might expect m = k Collocation Satisfy Bellman equations exactly at m states But Bellman equations are nonlinear, so in general might need more or fewer than k equations solution might not be unique hard to nd

Averagers For some function approximators [Gordon, Van Roy] m = k works solution is unique easy to nd Called averagers linear interpolation on mesh, multilinear interpolation Examples: grid, k-nearest-neighbor on

Pick a sample X Remember v(x) only for x 2 X Fitted value iteration nd v(x) for x 62 X use k-nearest-neighbor, linear interpolation, To... One step of tted VI has two parts: Do a step of exact VI Replace result with a cheap approximation

Approximate with v(x) = v(y) = w Approximators as mappings v(y) (,5) (3,3) v(x) MDP with 2 states, so value fn is 2D: (v(x) v(y))

7! M A 7! M A Approximators as mappings 4 3 2 4 3 2 2 3 4 5 6 2 3 4 5 6 4 3 2 4 3 2 2 3 4 5 6 2 3 4 5 6

7! M A 7! M A 2.5 2.5 Exaggeration 2.5 0.5 2.5 0.5 2 3 4 2 3 4 2.5 2.5 0.5 2.5 2.5 0.5 2 3 4 2 3 4 Two similar target value functions Larger dierence between tted functions Exaggeration = max-norm expansion

Goal Hill-car Forward Reverse Gravity tries to drive up hill, but engine is weak so must back up to Car momentum get

Results 2 2.5.5 0.5 0.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5 Approximate Exact

Results the hard way 2.5 0.5 0 0.5 0-0.5 - -2 -.5 - -0.5 0 0.5.5

renement (use ows, policy uncertainty, and value Adaptive to decide where to split) [Munos & Moore] uncertainty Interesting topics left uncovered Hidden state (similar problems to fn approx) Eligibility traces Policy iteration and -PI Actor-critic architectures Duality and RL Stopping problems

6 +7 Bonus extra slide: RL counterexample with values (0 0 0) Start one backup: (0 ) Do a line: (; 6 2 7 6 ) Fit another backup: (0 + 7 Do 6 ) For 6 7, divergence