It is hard to predict, especially about the future. Niels Bohr. You are what you pretend to be, so be careful what you pretend to be.

Size: px
Start display at page:

Download "It is hard to predict, especially about the future. Niels Bohr. You are what you pretend to be, so be careful what you pretend to be."

Transcription

1 It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, / 84

2 Convergence rate of TD(0) with function approximation Prashanth L A Joint work with Nathaniel Korda and Rémi Munos Indian Institute of Science MLRG - Oxford University Google DeepMind March 27, 2015 Prashanth L A Convergence rate of TD(0) March 27, / 84

3 Background Background Prashanth L A Convergence rate of TD(0) March 27, / 84

4 Background Markov Decision Processes (MDPs) MDP: Set of States X, Set of Actions A, Rewards r(x, a) Transition probability: p(s, a, s ) = Pr {s t+1 = s s t = s, a t = a} s t a s t t+1 r t+1 t t+1 Prashanth L A Convergence rate of TD(0) March 27, / 84

5 Background The Controlled Markov Property Controlled Markov Property: i 0, i 1,..., s, s, b 0, b 1...,a, P(s t+1 = s s t = s, a t = a,..., s 0 = i 0, a 0 = b 0 ) = p(s, a, s ) s t a t s t+1 t 2 t 1 t t+1 t+2 Figure: The Controlled Markov Behaviour Prashanth L A Convergence rate of TD(0) March 27, / 84

6 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

7 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

8 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

9 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

10 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

11 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84

12 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84

13 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84

14 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84

15 Background TD(0) with function approximation θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) Step-sizes Fixed-point iteration J. N. Tsitsiklis and B.V. Roy. (1997) show that θ n θ a.s., where Aθ = b, where A = Φ T Ψ(I βp)φ and b = Φ T Ψr. 1 J. N. Tsitsiklis and B.V. Roy. (1997) An analysis of temporal-difference learning with function approximation." In: IEEE Transactions on Automatic Control Prashanth L A Convergence rate of TD(0) March 27, / 84

16 Background Assumptions Ergodicity Markov chain induced by the policy π is irreducible and aperiodic. Moreover, there exists a stationary distribution Ψ(= Ψ π ) for this Markov chain. Linear independence Feature matrix Φ has full column rank λ min (Φ T ΨΦ) µ > 0 Bounded rewards r(s, π(s)) 1, for all s S. Bounded features φ(s) 2 1, for all s S. Prashanth L A Convergence rate of TD(0) March 27, / 84

17 Background Assumptions (contd) Step sizes satisfy n γ n =, and n γ 2 n <. Bounded mixing time a non-negative function B( ) such that: s 0 S and m 0, E(φ(s τ ) s 0 ) E Ψ (φ(s τ )) B(s 0 ), τ=0 E[φ(s τ )φ(s τ+m ) T s 0 ] E Ψ [φ(s τ )φ(s τ+m ) T ] B(s 0 ), τ=0 where B( ) satisfies: for any q > 1, there exists a K q < such that E[B q (s) s 0 ] K q B q (s 0 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

18 Concentration bounds: Non-averaged case In the long run we are all dead. John Maynard Keynes Question: What happens in a short run of TD(0) with function approximation? Prashanth L A Convergence rate of TD(0) March 27, / 84

19 Concentration bounds: Non-averaged case Concentration Bounds: Non-averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84

20 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84

21 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84

22 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

23 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

24 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

25 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84

26 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84

27 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84

28 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84

29 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

30 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

31 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 By Jensen s inequality, we obtain E( z n 2 s 0 ) (E( z n, z n ) s 0 ) 1 2 ( n 2 Π nz γk 2 Π nπ 1 k 2 ) ( ɛ E k 2 2 s 0 2 k=1 n + 2 γk 2 Π nπ 1 k 2 ) ( M ) 1 E k s 0 2 k=1 Rest of the proof amounts to bounding each of the terms on RHS above. Prashanth L A Convergence rate of TD(0) March 27, / 84

32 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

33 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

34 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

35 Concentration bounds: Iterate Averaging Concentration Bounds: Iterate Averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84

36 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84

37 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84

38 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84

39 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84

40 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where α can be chosen arbitrarily close to 1, resulting in a rate O ( 1/ n ). Prashanth L A Convergence rate of TD(0) March 27, / 84

41 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84

42 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84

43 Concentration bounds: Iterate Averaging Proof outline: Bound in expectation To bound the expected error we directly average the errors of the non-averaged iterates: E θn+1 θ 2 1 n n E θ k θ 2, and then specialise to the choice of step-size: γ n = (1 β)(c/(c + n)) α E θn+1 θ ( 1 + 9B(s0 ) 2 exp( µc(n + c) 1 α ) θ 0 θ n 2 n=1 +2βHc α (1 β) ( µc α (1 β) 2) ) α 1+2α 2(1 α) (n + c) α 2 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

44 Centered TD(0) Centered TD (CTD) Prashanth L A Convergence rate of TD(0) March 27, / 84

45 Centered TD(0) The Variance Problem Why does iterate averaging work? in TD(0), each iterate introduces a high variance, which must be controlled by the step-size choice averaging the iterates reduces the variance of the final estimator reduced variance allows for more exploration within the iterates through larger step sizes Prashanth L A Convergence rate of TD(0) March 27, / 84

46 Centered TD(0) A Control Variate Solution Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice Prashanth L A Convergence rate of TD(0) March 27, / 84

47 Centered TD(0) Centering: The Idea Recall that for TD(0), θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) }{{} =f n(θ n) and that θ n θ, the solution of F(θ) := ΠT π (Φθ) Φθ = 0. Centering each iterate: θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) Prashanth L A Convergence rate of TD(0) March 27, / 84

48 Centered TD(0) Centering: The Idea Why Centering helps? No updates after hitting θ θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O(d) update - same as TD(0) Working with epochs need to store only the averaged iterate θ n and an estimate of ˆF( θ n ) Prashanth L A Convergence rate of TD(0) March 27, / 84

49 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84

50 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84

51 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84

52 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84

53 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84

54 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84

55 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

56 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

57 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

58 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84

59 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84

60 Centered TD(0) Proof Outline Step 3: (Analysis for a particular epoch) ] E θn θ n+1 θ 2 2 θn θ γ2 E θn ɛ n γ(θn θ ) T E θn [ f Xin (θ ] [ n) + γ 2 E θn f Xin (θ n) 2 ( 2 ) θ n θ 2 2 2γ((1 β) d2 γ) Φ(θ n θ ) 2 Ψ + γ2 d 2 Φ( θ (m) θ ) 2 Ψ + γ 2 E θn ɛ n 2 2 Summing the above inequality over an epoch and noting that E Ψ,θn θ n+1 θ and ( θ (m) θ ) T I( θ (m) θ ) 1 µ ( θ (m) θ ) T Φ T ΨΦ( θ (m) θ ), we obtain the following by setting θ 0 = θ (m) : ( ) 1 ( ) 2γM((1 β) d 2 γ) Φ( θ (m+1) θ ) 2 Ψ µ + γ2 Md 2 Φ( θ (m) θ ) 2 Ψ +γ 2 mm i=(m 1)M E θi ɛ i 2 2 The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD. Prashanth L A Convergence rate of TD(0) March 27, / 84

61 Centered TD(0) TD(0) on a batch Prashanth L A Convergence rate of TD(0) March 27, / 84

62 Centered TD(0) Dilbert s boss on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84

63 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

64 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

65 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84

66 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84

67 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84

68 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84

69 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84

70 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84

71 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

72 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

73 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

74 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

75 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

76 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

77 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

78 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

79 fast LSTD The constants θ0 c ˆθ 2 T K 1 (n) = n + (1 β)ch2 (n), ((1 β)2 µc 1)/2 2 K 2 (n) = (1 β)c log δ 1 ( ) + K 1(n), (1 β)2 µc 1 where ( ( θ0 h(k) :=(1 + R max + β) 2 max ˆθ ) ) 2 T + ln n ˆθT, 1 Both K 1 (n) and K 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84

80 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

81 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

82 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

83 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

84 fast LSTD The constants C θ 0 ˆθ 2 T K1 IA (n) := (n + c) (1 α)/2 + h(n)c α (1 β), and (µc α (1 β) 2 ) α 1+2α 2(1 α) [ [ ] ] log δ K2 IA (n) := 1 3 α 2α 2 + µ(1 β) µc α (1 β) 2 + 2α α 1 + KIA (1 α)/2 (n + c) 1 (n). As before, both K IA 1 (n) and K IA 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84

85 fast LSTD Performance bounds True value function v Approximate value function ṽ n := Φθ n ( ) ( ) v ṽn T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 T 2 f T := T 1 f (s i ) 2, for any function f. i=1 2 Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, / 84

86 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

87 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

88 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

89 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84

90 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84

91 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84

92 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84

93 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84

94 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84

95 Experiments - Traffic Signal Control The traffic control problem Prashanth L A Convergence rate of TD(0) March 27, / 84

96 Experiments - Traffic Signal Control Simulation Results on 7x9-grid network Tracking error θ k ˆθ T 2 Throughput (TAR) θ k ˆθT TAR step k of flstd-sa 0 LSPI flspi-sa 0 1,000 2,000 3,000 4,000 5,000 time steps Prashanth L A Convergence rate of TD(0) March 27, / 84

97 Experiments - Traffic Signal Control Runtime Performance on three road networks runtime (ms) ,144 4, x9-Grid (d = 504) 14x9-Grid (d = 1008) 14x18-Grid (d = 2016) LSPI flspi-sa Prashanth L A Convergence rate of TD(0) March 27, / 84

98 Experiments - Traffic Signal Control SGD in Linear Bandits Prashanth L A Convergence rate of TD(0) March 27, / 84

99 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

100 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

101 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

102 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

103 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

104 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

105 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

106 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

107 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

108 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

109 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

110 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

111 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

112 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

113 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

114 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

115 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

116 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

117 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

118 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84

119 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84

120 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84

121 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84

122 Experiments - Traffic Signal Control Fast GD for Regression θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Solution: Use fast (online) gradient descent (GD) Efficient with complexity of only O(d) (Well-known) High probability bounds with explicit constants can be derived (not fully known) Prashanth L A Convergence rate of TD(0) March 27, / 84

123 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84

124 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84

125 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

126 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

127 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

128 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

129 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

130 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

131 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

132 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

133 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

134 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

135 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

136 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

137 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

138 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

139 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

140 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

141 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

142 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

143 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

144 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

145 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

146 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

147 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

148 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

149 Strongly convex bandits Regret bound for PEGE+fast GD (Strongly Convex Arms): (A3) The function G : θ arg min{θ T x} is J-Lipschitz. x D Theorem Under (A1)-(A3), regret R T := T i=1 x T i θ min x D xt θ satisfies R T CK 1 (n) 2 d 1 ( θ 2 + θ 1 2 ) T The bound is worse than that for PEGE by only a factor of O(log 4 (n)) Prashanth L A Convergence rate of TD(0) March 27, / 84

150 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

151 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

152 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

153 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

154 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

155 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

156 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

157 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

158 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

159 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

160 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

161 News recommendation application Dilbert s boss on news recommendation (and ML) Prashanth L A Convergence rate of TD(0) March 27, / 84

162 News recommendation application Preliminary Results on Complacs News Feed Platform 0 Cumulative reward iteration LinUCB LinUCB-GD Prashanth L A Convergence rate of TD(0) March 27, / 84

163 News recommendation application Experiments on Yahoo! Dataset 1 Figure: The Featured tab in Yahoo! Today module 1 Yahoo User-Click Log Dataset given under the Webscope program (2011) Prashanth L A Convergence rate of TD(0) March 27, / 84

164 News recommendation application Tracking Error Tracking error: SGD Tracking error: SVRG 1 Tracking error: SAG 2 1 SGD 1 SVRG 1 SAG θn θn 0.5 θn θn 0.5 θn θn iteration n of flinucb-gd iteration n of flinucb-svrg iteration n of flinucb-sag Johnson, R., and Zhang, T. (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: NIPS 2 Roux, N. L., Schmidt, M. and Bach, F. (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. arxiv preprint arxiv: Prashanth L A Convergence rate of TD(0) March 27, / 84

165 News recommendation application Runtime Performance on two days of the Yahoo! dataset runtime (ms) ,818 4,933 44, , ,474 Day-2 Day-4 LinUCB flinucb-gd flinucb-svrg flinucb-sag Prashanth L A Convergence rate of TD(0) March 27, / 84

166 For Further Reading For Further Reading I Nathaniel Korda and Prashanth L.A., On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. arxiv: , Prashanth L.A., Nathaniel Korda and Rémi Munos, Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML, Nathaniel Korda, Prashanth L.A. and Rémi Munos, Fast gradient descent for least squares regression: Non-asymptotic bounds and application to bandits. AAAI, Prashanth L A Convergence rate of TD(0) March 27, / 84

167 For Further Reading Dilbert s boss (again) on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017 Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

Sequential and reinforcement learning: Stochastic Optimization I

Sequential and reinforcement learning: Stochastic Optimization I 1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Bits of Machine Learning Part 2: Unsupervised Learning

Bits of Machine Learning Part 2: Unsupervised Learning Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

arxiv: v1 [cs.sy] 29 Sep 2016

arxiv: v1 [cs.sy] 29 Sep 2016 A Cross Entropy based Stochastic Approximation Algorithm for Reinforcement Learning with Linear Function Approximation arxiv:1609.09449v1 cs.sy 29 Sep 2016 Ajin George Joseph Department of Computer Science

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Lecture 10 - Planning under Uncertainty (III)

Lecture 10 - Planning under Uncertainty (III) Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Chapter 13 Wow! Least Squares Methods in Batch RL

Chapter 13 Wow! Least Squares Methods in Batch RL Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Least-Squares Methods for Policy Iteration

Least-Squares Methods for Policy Iteration Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Large-scale Information Processing, Summer Recommender Systems (part 2)

Large-scale Information Processing, Summer Recommender Systems (part 2) Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a

More information