Reinforcement learning

Size: px

Start display at page:

Download "Reinforcement learning"

Juniper Stewart Morton
5 years ago
Views:

1 Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen

2 Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. (Kaelbling et al. 1996). Approaches: - search space of behaviors: genetic algorithms - dynamic programming a changes world, i is indication of current state s, r is reinforcement signal Behaviour should be such as to increase the long-run sum of values of r. Bert Kappen Reinforcement learning 1

3 Formally: - discrete set of environment states S - discrete set of agent actions A - set of scalar reinforcement signals, (0,1) or real Find optimal policy π : S A Environment is - non-deterministic: taking same action in same state may yield different next state - stationary: p(s s, a) independent of time. Bert Kappen Reinforcement learning 2

4 Models of optimallity The finite horizon model: h r t Does not consider at t = 0 what happens after t = h. Two uses: t=0 - Fixed horizon: Take h-step optimal action, (h-1)-step optimal action,..., 1-step optimal action - Receding horizon: Take always h-step optimal action Bert Kappen Reinforcement learning 3

5 Models of optimallity The infinite horizon discounted model: γ t r t t=0 0 γ < 1 γ is probability to live another step, or mathematical trick to bound infinite sum. Bert Kappen Reinforcement learning 4

6 The average reward model: Models of optimallity lim h Is limit of discounted model for γ 1. 1 h h r t Problem with this model is that there is no way to distinguish between two policies, one of which gains a large amount of reward in the initial phases and the other of which does not. t=0 Bert Kappen Reinforcement learning 5

7 Models of optimallity Only single action with three choices from start state (upper left circle is t = 0). Different criteria yield different optimal solutions: Finite horizon h = 5 model prefers first choice: 5 t=0 r t = (6, 0, 0). Discounted reward γ = 0.9 model prefers second choice γ t r t = t=0 2 t=2 γ t, 10 γ t, 11 t=5 t=6 γ t = ( 2γ 2, 10γ 5, 11γ 6) 1 1 γ = (16.2, 59.0, 58.5) Average reward prefers third choice: 1 ht=0 h r t = 1 h (2, 10, 11) ( ht=2 r, h t=5 r, h t=6 r) = ( 2 h 1 h, 10h 4 h, ) 11h 5 h h Bert Kappen Reinforcement learning 6

8 Where we used t=0 γ t = 1 1 γ. Proof: Define S = T t=0 γ t. Then (1 γ)s = t=0 T γ t t=0 T γ t+1 = t=0 T γ t t=0 T+1 t=1 γ t 1 γ T+1 = lim S = lim T T 1 γ = 1 1 γ γ t = 1 γ T+1 Bert Kappen Reinforcement learning 7

9 Models of optimallity h, γ model the horizon time. Optimal policy depends strongly on horizon time. When h small, finite horizon cost prefers immediate rewards. When h in finite horizon cost, it becomes like average cost. Discounted reward with small γ prefers immediate rewards. γ = 0.9 γ t r t = t=0 2 γ t, 10 t=2 γ t, 11 t=5 t=6 γ t = ( 2γ 2, 10γ 5, 11γ 6) 1 1 γ = (16.2, 59.0, 58.5) γ = 0.2 γ t r t = t=0 2 γ t, 10 t=2 γ t, 11 t=5 t=6 γ t = ( 2γ 2, 10γ 5, 11γ 6) 1 1 γ = (0.1, 0.004, ) Bert Kappen Reinforcement learning 8

10 Discrete time control Consider the control of a discrete time deterministic dynamical system: x t+1 = x t + f (t, x t, u t ), t = 0, 1,..., T 1 x t describes the state and u t specifies the control or action at time t. Given x t=0 = x 0 and u 0:T 1 = u 0, u 1,..., u T 1, we can compute x 1:T. Define a cost for each sequence of controls: C(x 0, u 0:T 1 ) = φ(x T ) + T 1 t=0 R(t, x t, u t ) The problem of optimal control is to find the sequence u 0:T 1 that minimizes C(x 0, u 0:T 1 ). Bert Kappen Reinforcement learning 9

11 Dynamic programming Find the minimal cost path from A to J. C(J) = 0, C(H) = 3, C(I) = 4 C(F) = min(6 + C(H), 3 + C(I)) Bert Kappen Reinforcement learning 10

12 Discrete time control The optimal control problem can be solved by dynamic programming. Introduce the optimal cost-to-go: J(t, x t ) = min u t:t 1 φ(x T) + T 1 s=t R(s, x s, u s ) which solves the optimal control problem from an intermediate time t until the fixed end time T, for all intermediate states x t. Then, J(T, x) = φ(x) J(0, x) = min u 0:T 1 C(x, u 0:T 1 ) Bert Kappen Reinforcement learning 11

13 Discrete time control One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way: J(t, x t ) = min u t:t 1 φ(x T) + = min u t T 1 s=t R(s, x s, u s ) R(t, x t, u t ) + min u t+1:t 1 φ(x T) + = min u t (R(t, x t, u t ) + J(t + 1, x t+1 )) This is called the Bellman Equation. T 1 s=t+1 = min u t (R(t, x t, u t ) + J(t + 1, x t + f (t, x t, u t ))) Computes u as a function of x, t for all intermediate t and all x. R(s, x s, u s ) Bert Kappen Reinforcement learning 12

14 Discrete time control The algorithm to compute the optimal control u 0:T 1, the optimal trajectory x 1:T and the optimal cost is given by 1. Initialization: J(T, x) = φ(x) 2. Backwards: For t = T 1,..., 0 and for all x compute u t (x) = arg min{r(t, x, u) + J(t + 1, x + f (t, x, u))} u J(t, x) = R(t, x, u t ) + J(t + 1, x + f (t, x, u t )) 3. Forwards: For t = 0,..., T 1 compute x t+1 = x t + f (t, x t, u t (x t )) NB: the backward computation requires u t (x) for all x. Bert Kappen Reinforcement learning 13

15 Stochastic case x t+1 = x t + f (t, x t, u t, w t ) t = 0,..., T 1 At time t, w t is a random value drawn from a probability distribution p(w). For instance, x t+1 = x t + w t, x 0 = 0 w t = ±1, p(w t = 1) = p(w t = 1) = 1/2 x t = t 1 s=0 w s Thus, x t random variable. Bert Kappen Reinforcement learning 14

16 Stochastic case C(x 0 ) = = φ(x T ) + w 0:T 1 T 1 t=0 R(t, x t, u t, ξ t ) p(w 0:T 1 )p(ξ 0:T 1 ) φ(x T) + ξ 0:T 1 T 1 t=0 R(t, x t, u t, ξ t ) with ξ t, x t, w t random. Closed loop control: find functions u t (x t ) that minimizes the remaining expected cost when in state x at time t. π = {u 0 ( ),..., u T 1 ( )} is called a policy. π = argmin π C π (x 0 ) is optimal policy. x t+1 = x t + f (t, x t, u t (x t ), w t ) T 1 C π (x 0 ) = φ(x T ) + R(t, x t, u t (x t ), ξ t ) t=0 Bert Kappen Reinforcement learning 15

17 Stochastic Bellman Equation J(t, x t ) = min u t R(t, x t, u t, ξ t ) + J(t + 1, x t + f (t, x t, u t, w t )) J(T, x) = φ(x) u t is optimized for each x t separately. π = {u 0,..., u T 1 } is optimal a policy. Bert Kappen Reinforcement learning 16

18 Inventory problem x t = 0, 1, 2 stock available at the beginning of period t. u t stock ordered at the beginning of period t. Maximum storage is 2: u t 2 x t. w t = 0, 1, 2 demand during period t with p(w = 0, 1, 2) = (0.1, 0.7, 0.2); excess demand is lost. u t is the cost of purchasing u t units. (x t +u t w t ) 2 is cost of stock at end of period t. Planning horizon T = 3. x t+1 = max(0, x t + u t w t ) t=2 C(x 0, u 0:T 1 ) = u t + (x t + u t w t ) 2 t=0 Bert Kappen Reinforcement learning 17

19 Inventory problem Bert Kappen Reinforcement learning 18

20 Apply Bellman Equation Start with J 3 (x 3 ) = 0, x 3. J t (x t ) = min u t R(x t, u t, w t ) + J t+1 ( f (x t, u t, w t )) R(x, u, w) = u + (x + u w) 2 f (x, u, w) = max(0, x + u w) Bert Kappen Reinforcement learning 19

21 Dynamic programming in action Assume we are at stage t = 2 and the stock is x 2. The cost-to-go is what we order u 2 and how much we have left at the end of period t = 2. J 2 (x 2 ) = min 0 u 2 2 x 2 u 2 + (x 2 + u 2 w 2 ) 2 = min 0 u 2 2 x 2 ( u (x 2 + u 2 ) (x 2 + u 2 1) (x 2 + u 2 2) 2) J 2 (0) = ( min u u (u 2 1) (u 2 2) 2) 0 u 2 2 u 2 = 0 : rhs = = 1.5 u 2 = 1 : rhs = = 1.3 u 2 = 2 : rhs = = 3.1 Thus, u 2 (x 2 = 0) = 1 and J 2 (x 2 = 0) = 1.3 Bert Kappen Reinforcement learning 20

22 Inventory problem The computation can be repeated for x 2 = 1 and x 2 = 2, completing stage 2 and subsequently for stage 1 and stage 0. Bert Kappen Reinforcement learning 21

23 Exploitation versus Exploration: The Single-State Case The k-armed bandit problem: The agent is in a room with a collection of k gambling machines (each called a one-armed bandit ). The agent is permitted a fixed number of pulls, h. Any arm may be pulled on each turn. The machines do not require a deposit to play; the only cost is in wasting a pull playing a suboptimal machine. When arm i is pulled, machine i pays off 1 or 0, with unknown probability p i. What should the agent s strategy be? Trade-off between exploration: try many new arms exploitation: stick with a good arm The bandit problem is a RL problem with a single state. Bert Kappen Reinforcement learning 22

24 Bayesian model We assume prior distributions over the parameters p i. Consider the Beta distribution over 0 x 1 parametrized by α, β > 0 integers: P(x α, β) = x = (α + β 1)! (α 1)!(β 1)! xα 1 (1 x) β 1 α α + β P(p i α = β = 1) can be used as a flat prior over p i to model ignorance of the value of p i. When pulling arm i n i times giving w i times a payoff 1, we can compute the posterior distribution over p i as P(p i n i, w i ) likelihood to observe w i in n i trials given p i prior p w i i (1 p i ) n i w i P(p i α = w i + 1, β = n i w i + 1) p i = w i + 1 n i + 2 NB if you pull once: n i = w i = 1, the expected return is 2/3. Bert Kappen Reinforcement learning 23

25 Dynamic programming solution Although the agent has only one state, the knowledge (or belief) of the agent changes while playing. This is the notion of belief state. If arm i is pulled n i times, yielding a positive payoff in w i times, the belief state is {n 1, w 1,..., n k, w k } with k the number of bandits. Suppose we can pull in total h times one of the arms. Define t = i n i it the current iteration, 0 t h. At each t we wish to pull the best arm based on our experience sofar. We write V (n 1, w 1,..., n k, w k ) as the expected remaining payoff at time t = i n i, given that a total of h pulls are available, and we use the remaining pulls optimally. Bert Kappen Reinforcement learning 24

26 Dynamic programming solution If i n i = h there are no remaining pulls and V (n 1, w 1,..., n k, w k ) = 0. If we know V for all states at t, we can compute V for any belief state with t 1: V (n 1, w 1,..., n k, w k ) = max agent takes action i and optimally for remaining pulls i = [ max ρi (arm i returns 1) + (1 ρ i )(arm i returns 0) ] i NB: Error in Kaelbling formula = max i ρ i + ρ i V (n 1, w 1,..., n i + 1, w i + 1,..., n k, w k ) + (1 ρ i )V (n 1, w 1,..., n i + 1, w i,..., n k, w k ) ρ i = p i = w i + 1 n i + 2 Linear in the number of belief states times actions and thus exponential in the horizon. Bert Kappen Reinforcement learning 25

27 Example h=4, two bandits. Notation: V(n 1, w 1, n 2, w 2 ) = (n 1 w 1 n 2 w 2 ) ρ 1 = w n ρ 2 = w n Use Bellman equation to compute backwards all values: If t = n 1 + n 2 = 4 V (n 1, w 1, n 2, w 2 ) = 0 Consider states with t = n 1 + n 2 = 3. For instance, ( 1 (0030) = max(ρ 1, ρ 2 ) = max 2, 1 ) 5 ( 3 (2211) = max(ρ 1, ρ 2 ) = max 4, 2 ) 3 ( 2 (2111) = max(ρ 1, ρ 2 ) = max 4, 2 ) 3... = 1 2 = 3 4 = (1122) = 2 3 = (1121) Bert Kappen Reinforcement learning 26

28 Consider states with t = n 1 + n 2 = 2. For instance, Matlab results: t= 3: (1111) = max [ ρ 1 + ρ 1 (2211) + (1 ρ 1 )(2111), ρ 2 + ρ 2 (1122) + (1 ρ 2 )(1121) ] = 2 ( ) = 1.39 (0030)=0.50 (0031)=0.50 (0032)=0.60 (0033)=0.80 (1020)=0.33 (1021)=0.50 (1022)=0.75 (1120)=0.67 (1121)=0.67 (1122)=0.75 (2010)=0.33 (2011)=0.67 (2110)=0.50 (2111)=0.67 (2210)=0.75 (2211)=0.75 (3000)=0.50 (3100)=0.50 (3200)=0.60 (3300)=0.80 t= 2: (0020)=1.00 (0021)=1.08 (0022)=1.50 (1010)=0.72 (1011)=1.33 (1110)=1.33 (1111)=1.39 (2000)=1.00 (2100)=1.08 (2200)=1.50 t= 1: (0010)=1.53 (0011)=2.03 (1000)=1.53 (1100)=2.03 t= 0: (0000)=2.28 Bert Kappen Reinforcement learning 27

29 Example V (n 1, w 1, n 2, w 2 ) = max (ρ 1 + ρ 1 V (n 1 + 1, w 1 + 1, n 2, w 2 ) + (1 ρ 1 )V (n 1 + 1, w 1, n 2, w 2 ), ρ i = w i + 1 n i + 2 ρ 2 + ρ 2 V (n 1, w 1, n 2 + 1, w 2 + 1) + (1 ρ 2 )V (n 1, w 1, n 2 + 1, w 2 )) Use values to compute forward optimal strategy: First step: Pull arm 1 and win. ρ 1 = 2/3, ρ 2 = 1/2. Second step: Optimal second pull from state (1100): argmax (ρ 1 + ρ 1 (2200) + (1 ρ 1 )(2100), ρ 2 + ρ 2 (1111) + (1 ρ 2 )(1110)) = argmax (2/3 + 2/ /3 1.08, 1/2 + 1/ /2 1.33) = argmax(2.03, 1.86)... Bert Kappen Reinforcement learning 28

30 Ad-hoc strategies Strategies that do not use the Bellman equation. Optimism in the face of uncertainty: - put strong optimistic prior belief P(p i n i, w i ). - For instance use w i = n i with n i a number of phantom plays. P(p i ) p n i i Randomized strategies: P(a = i) = exp(βρ i ) kj=1 exp(βρ j ) ρ i = w i + 1 n i + 2 T = 1/β is temperature which is decreased over time to decrease exploration. ɛ-greedy when K actions are possible: - choose best action with probably 1 ɛ + ɛ K - choose any other action with probability ɛ K Bert Kappen Reinforcement learning 29

31 Markov Decision Processes A set of states S, set of actions A, reward function R : S A R. A state transition function T : S A Π(S), with Π(S) is set of probability distributions over S. We denote T(s s, a). The model is first order Markov because the distribution over next states s only depend on current state and action s, a and no previous history. We define π : S A as a policy. We define the optimal value of a state as V (s) = max π γ t r t t=0 For the infinite-horizon discounted model, there exists an optimal deterministic stationary policy ([Bellman, 1957]). s 0 =s Bert Kappen Reinforcement learning 30

32 Markov Decision Processes The optimal policy is V (s) = max a R(s, a) + γ T(s s, a)v (s ) s π (s) = argmax a R(s, a) + γ T(s s, a)v (s ) s Bert Kappen Reinforcement learning 31

33 Value iteration Value iteration converges to V ([Bellman, 1957]) Stopping criterion (Williams & Baird 1993): if max s V t (s) V t 1 (s) = ɛ then max s π t (s) π (s) 2ɛγ/(1 γ) Computational complexity is O( S 2 A ), or O( S A ) when constant number of next states per state (sparse T). # iterations polynomial in 1/(1 γ). Bert Kappen Reinforcement learning 32

34 Policy iteration Manipulates the policy directly, rather than indirectly through the value function: V π is the value of policy π. The policy update is greedy with respect to V π. Bert Kappen Reinforcement learning 33

35 Learning an Optimal Policy: Model-free Methods Reinforcement learning is primarily concerned with how to obtain the optimal policy when T(s s, a) and R(s, a) are not known in advance. Two approaches: - Model free: Learn a controller without learning a model - Model based: learn a model, and use it to derive a controller Which is better? - Model free learns a single task, no generalization, faster, simple tasks - Model based can be task independent, more complex tasks, slower Bert Kappen Reinforcement learning 34

36 Monte Carlo sampling Simplest method is for given policy π run N sample trajectories of length h always starting in state s: V π (s) = γ t r t 1 N N h γ t rt i t=1 i=1 t=0 Repeat for each state s. Inefficient: - states reappear in multiple sample trajectories - statistics starting from those states are lost Bert Kappen Reinforcement learning 35

37 Adaptive Heuristic Critic and TD(λ) AHC is adaptive version of policy iteration ([Barto et al., 1983]) - Critic: compute estimate of V π for policy π used by actor/rl component - Actor: optimise π based on (current estimate of) V π. NB: Only version with full convergence of inner loop critic for fixed policy can be guaranteed to converge to optimal policy. Bert Kappen Reinforcement learning 36

38 Adaptive Heuristic Critic and TD(λ) Critic is learned by: - Consider experience tuple (s, a, r, s ) under policy π. V(s) := V(s) + α t (r + γv(s ) V(s)) - This rule is called TD(0) and converges to the solution of policy evaluation V π. Bert Kappen Reinforcement learning 37

39 Multiplying by s T(s s, π(s)) yields evaluation of V π : 0 = α R(s, π(s)) + γ T(s s, π(s)v(s ) V(s) s Method is known as stochastic approximation originally due to Robbins and Monro 1951: - Solve M(x) = a with M(x) = N(x, ξ). - Iterate x t+1 = x t + α t (a N(x, ξ)) - Convergence requires α t = t α 2 t < t For instance α t = 1/t. 1 1 Correspondence: x V(s), ξ γv(s ), a R(s, π(s)). Then N(x, ξ) = x ξ = V(s) γv(s ) M(x) = V(s) γ T(s s, π(s))v(s ) s Bert Kappen Reinforcement learning 38

40 TD(λ) TD(0) converges but makes poor use of the data: only the immediate previous state is updated. TD(λ) updates every state according to discount 0 λ 1: When s 1 s 2 : d t = r t + γv(s t+1 ) V(s t ) When s 2 s 3 : V(s 1 ) := V(s 1 ) + α ms1 d 1 V(s 1 ) := V(s 1 ) + α ms1 λd 2 V(s 2 ) := V(s 2 ) + α ms2 d 2 m s is the number of times state s has been visited. Bert Kappen Reinforcement learning 39

41 TD(λ) In general at iteration t: d t = r t + γv(s t+1 ) V(s t ) t ɛ(s) = λ t k δ s,sk s k=1 V(s) := V(s) + α ms d t ɛ(s) s States are updated proportional to their eligibility ɛ(s) that decays over time. t state ɛ(s) 1 1 (λ 0, 0, 0) 2 2 (λ 1, λ 0, 0) 3 3 (λ 2, λ 1, λ 0 ) 4 1 (λ 3 + λ 0, λ 2, λ 1 ) Bert Kappen Reinforcement learning 40

42 When: - α m is a decreasing series satisfying Robbins-Monro criteria (cf. 1/m) - all states are visited infinitely often TD(λ) converges to the optimal solution with probability 1 [Bertsekas and Tsitsiklis, 1996 NB Error in Kaelbling formula Bert Kappen Reinforcement learning 41

43 Q learning The two components of AHC can be unified in the Q learning algorithm (Watkins 1989). Denote Q(s, a) the optimal expected value of state s when taking action a and then proceeding optimally. That is and V (s) = max a Q(s, a). Q(s, a) = R(s, a) + γ Using stochastic approximation, we obtain - Generate s from environment T(s s, a) - Update Q(s, a) = Q(s, a) + α(r + γ max a - Generate a either random or argmax Q(s, a ). Convergence under similar criteria as TD. s T(s s, a) max a Q(s, a ) Q(s, a ) Q(s, a)) Bert Kappen Reinforcement learning 42

44 Model based approaches Model free methods are data-inefficient. Simplest model based approach: - make arbitrary division between learning and action phase - gather data some way. Problems: - Random exploration may be dangerous and/or inefficient. - blind to changes in the environment Bert Kappen Reinforcement learning 43

45 Dyna Idea: combine model based and model free. Sutton 1990 Bert Kappen Reinforcement learning 44

46 Maze: Dyna example - In each of the 46 states there are 4 actions (N,E,S,W) which take the agent to the corresponding state. When movement is blocked by obstacle, no movement results. - reward is zero for all states and transitions except into the goal state G. - after reaching the goal state the episode ends. agent returns to start state S. - γ = 0.95 (discount rate), α = 0.1 (learning rate), ɛ = 0.1 (exploration rate). Bert Kappen Reinforcement learning 45

47 Dyna example Left) Average over 30 runs of number of steps per episode. First episode requires 1700 steps. Right) Policies found by planning (N=50) and non-planning (N=0) Dyna halfway through the second episode. N = 0 Dyna (normal Q-learning) has only updated policy for next-to-goal state. N = 50 Dyna has learned environment model from first episode which is used to learn policies for all states. Bert Kappen Reinforcement learning 46

48 Dyna larger example 3277 states shortest path problem formulated as discounted RL problem. Goal state has reward 1, all other states have reward 0. Dyna (and prioritised sweeping) used N = 200 backups per transition. Bert Kappen Reinforcement learning 47

49 Generalizations A shortcoming of the Dyna method is that the planning steps are done at random. - improvement can be made by prioritized sweeping (Moore & Atkenson 1993) by updating the states with highest priority. Combining Dyna with Monte-Carlo tree search yields state-of-the-art performance on 9 9 computer Go [Silver et al., 2012] Bert Kappen Reinforcement learning 48

50 Linear function approximation In the standard treatment of RL, the Bellman equations such as value iteration, policy iteration and Q learning, discussed in the paper by Kaelbling the basic quantity is the value of a state V(s) and the RL rules update V(s). In the book of Dayan and Abbott, instead, the update rules for temporal difference learning (Eq , ) use a representation of the state in terms of features: V(s) = k w k φ k (s) and the RL rules update w k. The relation can be understood by considering V(s) = k w k φ k (s) as an approximation to the true value function V(s) which is the solution of the Bellman equation. Consider for example the online version of value iteration. Based on the experience tuple (s, a, r, s ) the value of V(s) is updated to V + (s) = V(s) + α(r + γv(s ) V(s)) = V(s) + αδ δ = r + γv(s ) V(s) The question is now how the vector w k is updated to realize the change from V(s) V + (s). The answer is to adapt w k so as to reduce the quadratic difference between V + (s) Bert Kappen Reinforcement learning 49

51 and V(s): C = V+ (s) w k φ k (s) C = 2φ k (s) w k V+ (s) w k w k β C w k = w k + γφ k (s)δ k 2 φ l (s)w l = 2αφ k(s)δ with γ = 2βα, which is equivalent to the rules (Eq , ). See [Geramifard et al., 2013] for further details. l Bert Kappen Reinforcement learning 50

52 References [Barto et al., 1983] Barto, A., Sutton, R. S., and Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5): [Bellman, 1957] Bellman, R. (1957). Dynamic programming. Princeton University Press. [Bertsekas, 2000] Bertsekas, D. (2000). Dynamic Programming and optimal control. Athena Scientific, Belmont, Massachusetts. Second edition. [Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-dynamic programming. Athena Scientific, Belmont, Massachusetts. [Geramifard et al., 2013] Geramifard, A., Walsh, T., Tellex, S., Chowdhary, G., Roy, N., and How, J. (2013). A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends in Machine Learning, 6: [Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning: a survey. Journal of Artificial Intelligence research, 4: [Silver et al., 2012] Silver, D., Sutton, R. S., and Müller, M. (2012). Temporal-difference search in computer go. Machine learning, 87(2): Bert Kappen Reinforcement learning 51

Temporal difference learning

Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).