It is hard to predict, especially about the future. Niels Bohr. You are what you pretend to be, so be careful what you pretend to be.

Size: px

Start display at page:

Download "It is hard to predict, especially about the future. Niels Bohr. You are what you pretend to be, so be careful what you pretend to be."

Ethel Craig
5 years ago
Views:

1 It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, / 84

2 Convergence rate of TD(0) with function approximation Prashanth L A Joint work with Nathaniel Korda and Rémi Munos Indian Institute of Science MLRG - Oxford University Google DeepMind March 27, 2015 Prashanth L A Convergence rate of TD(0) March 27, / 84

3 Background Background Prashanth L A Convergence rate of TD(0) March 27, / 84

4 Background Markov Decision Processes (MDPs) MDP: Set of States X, Set of Actions A, Rewards r(x, a) Transition probability: p(s, a, s ) = Pr {s t+1 = s s t = s, a t = a} s t a s t t+1 r t+1 t t+1 Prashanth L A Convergence rate of TD(0) March 27, / 84

5 Background The Controlled Markov Property Controlled Markov Property: i 0, i 1,..., s, s, b 0, b 1...,a, P(s t+1 = s s t = s, a t = a,..., s 0 = i 0, a 0 = b 0 ) = p(s, a, s ) s t a t s t+1 t 2 t 1 t t+1 t+2 Figure: The Controlled Markov Behaviour Prashanth L A Convergence rate of TD(0) March 27, / 84

6 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

7 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

8 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

9 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

10 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84

11 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84

12 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84

13 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84

14 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84

15 Background TD(0) with function approximation θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) Step-sizes Fixed-point iteration J. N. Tsitsiklis and B.V. Roy. (1997) show that θ n θ a.s., where Aθ = b, where A = Φ T Ψ(I βp)φ and b = Φ T Ψr. 1 J. N. Tsitsiklis and B.V. Roy. (1997) An analysis of temporal-difference learning with function approximation." In: IEEE Transactions on Automatic Control Prashanth L A Convergence rate of TD(0) March 27, / 84

16 Background Assumptions Ergodicity Markov chain induced by the policy π is irreducible and aperiodic. Moreover, there exists a stationary distribution Ψ(= Ψ π ) for this Markov chain. Linear independence Feature matrix Φ has full column rank λ min (Φ T ΨΦ) µ > 0 Bounded rewards r(s, π(s)) 1, for all s S. Bounded features φ(s) 2 1, for all s S. Prashanth L A Convergence rate of TD(0) March 27, / 84

17 Background Assumptions (contd) Step sizes satisfy n γ n =, and n γ 2 n <. Bounded mixing time a non-negative function B( ) such that: s 0 S and m 0, E(φ(s τ ) s 0 ) E Ψ (φ(s τ )) B(s 0 ), τ=0 E[φ(s τ )φ(s τ+m ) T s 0 ] E Ψ [φ(s τ )φ(s τ+m ) T ] B(s 0 ), τ=0 where B( ) satisfies: for any q > 1, there exists a K q < such that E[B q (s) s 0 ] K q B q (s 0 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

18 Concentration bounds: Non-averaged case In the long run we are all dead. John Maynard Keynes Question: What happens in a short run of TD(0) with function approximation? Prashanth L A Convergence rate of TD(0) March 27, / 84

19 Concentration bounds: Non-averaged case Concentration Bounds: Non-averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84

20 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84

21 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84

22 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

23 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

24 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84

25 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84

26 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84

27 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84

28 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84

29 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

30 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

31 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 By Jensen s inequality, we obtain E( z n 2 s 0 ) (E( z n, z n ) s 0 ) 1 2 ( n 2 Π nz γk 2 Π nπ 1 k 2 ) ( ɛ E k 2 2 s 0 2 k=1 n + 2 γk 2 Π nπ 1 k 2 ) ( M ) 1 E k s 0 2 k=1 Rest of the proof amounts to bounding each of the terms on RHS above. Prashanth L A Convergence rate of TD(0) March 27, / 84

32 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

33 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

34 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84

35 Concentration bounds: Iterate Averaging Concentration Bounds: Iterate Averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84

36 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84

37 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84

38 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84

39 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84

40 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where α can be chosen arbitrarily close to 1, resulting in a rate O ( 1/ n ). Prashanth L A Convergence rate of TD(0) March 27, / 84

41 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84

42 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84

43 Concentration bounds: Iterate Averaging Proof outline: Bound in expectation To bound the expected error we directly average the errors of the non-averaged iterates: E θn+1 θ 2 1 n n E θ k θ 2, and then specialise to the choice of step-size: γ n = (1 β)(c/(c + n)) α E θn+1 θ ( 1 + 9B(s0 ) 2 exp( µc(n + c) 1 α ) θ 0 θ n 2 n=1 +2βHc α (1 β) ( µc α (1 β) 2) ) α 1+2α 2(1 α) (n + c) α 2 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

44 Centered TD(0) Centered TD (CTD) Prashanth L A Convergence rate of TD(0) March 27, / 84

45 Centered TD(0) The Variance Problem Why does iterate averaging work? in TD(0), each iterate introduces a high variance, which must be controlled by the step-size choice averaging the iterates reduces the variance of the final estimator reduced variance allows for more exploration within the iterates through larger step sizes Prashanth L A Convergence rate of TD(0) March 27, / 84

46 Centered TD(0) A Control Variate Solution Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice Prashanth L A Convergence rate of TD(0) March 27, / 84

47 Centered TD(0) Centering: The Idea Recall that for TD(0), θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) }{{} =f n(θ n) and that θ n θ, the solution of F(θ) := ΠT π (Φθ) Φθ = 0. Centering each iterate: θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) Prashanth L A Convergence rate of TD(0) March 27, / 84

48 Centered TD(0) Centering: The Idea Why Centering helps? No updates after hitting θ θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O(d) update - same as TD(0) Working with epochs need to store only the averaged iterate θ n and an estimate of ˆF( θ n ) Prashanth L A Convergence rate of TD(0) March 27, / 84

49 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84

50 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84

51 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84

52 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84

53 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84

54 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84

55 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

56 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

57 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84

58 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84

59 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84

60 Centered TD(0) Proof Outline Step 3: (Analysis for a particular epoch) ] E θn θ n+1 θ 2 2 θn θ γ2 E θn ɛ n γ(θn θ ) T E θn [ f Xin (θ ] [ n) + γ 2 E θn f Xin (θ n) 2 ( 2 ) θ n θ 2 2 2γ((1 β) d2 γ) Φ(θ n θ ) 2 Ψ + γ2 d 2 Φ( θ (m) θ ) 2 Ψ + γ 2 E θn ɛ n 2 2 Summing the above inequality over an epoch and noting that E Ψ,θn θ n+1 θ and ( θ (m) θ ) T I( θ (m) θ ) 1 µ ( θ (m) θ ) T Φ T ΨΦ( θ (m) θ ), we obtain the following by setting θ 0 = θ (m) : ( ) 1 ( ) 2γM((1 β) d 2 γ) Φ( θ (m+1) θ ) 2 Ψ µ + γ2 Md 2 Φ( θ (m) θ ) 2 Ψ +γ 2 mm i=(m 1)M E θi ɛ i 2 2 The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD. Prashanth L A Convergence rate of TD(0) March 27, / 84

61 Centered TD(0) TD(0) on a batch Prashanth L A Convergence rate of TD(0) March 27, / 84

62 Centered TD(0) Dilbert s boss on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84

63 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

64 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

65 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84

66 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84

67 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84

68 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84

69 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84

70 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84

71 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

72 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

73 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

74 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84

75 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

76 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

77 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

78 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84

79 fast LSTD The constants θ0 c ˆθ 2 T K 1 (n) = n + (1 β)ch2 (n), ((1 β)2 µc 1)/2 2 K 2 (n) = (1 β)c log δ 1 ( ) + K 1(n), (1 β)2 µc 1 where ( ( θ0 h(k) :=(1 + R max + β) 2 max ˆθ ) ) 2 T + ln n ˆθT, 1 Both K 1 (n) and K 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84

80 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

81 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

82 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

83 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84

84 fast LSTD The constants C θ 0 ˆθ 2 T K1 IA (n) := (n + c) (1 α)/2 + h(n)c α (1 β), and (µc α (1 β) 2 ) α 1+2α 2(1 α) [ [ ] ] log δ K2 IA (n) := 1 3 α 2α 2 + µ(1 β) µc α (1 β) 2 + 2α α 1 + KIA (1 α)/2 (n + c) 1 (n). As before, both K IA 1 (n) and K IA 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84

85 fast LSTD Performance bounds True value function v Approximate value function ṽ n := Φθ n ( ) ( ) v ṽn T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 T 2 f T := T 1 f (s i ) 2, for any function f. i=1 2 Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, / 84

86 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

87 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

88 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84

89 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84

90 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84

91 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84

92 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84

93 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84

94 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84

95 Experiments - Traffic Signal Control The traffic control problem Prashanth L A Convergence rate of TD(0) March 27, / 84

96 Experiments - Traffic Signal Control Simulation Results on 7x9-grid network Tracking error θ k ˆθ T 2 Throughput (TAR) θ k ˆθT TAR step k of flstd-sa 0 LSPI flspi-sa 0 1,000 2,000 3,000 4,000 5,000 time steps Prashanth L A Convergence rate of TD(0) March 27, / 84

97 Experiments - Traffic Signal Control Runtime Performance on three road networks runtime (ms) ,144 4, x9-Grid (d = 504) 14x9-Grid (d = 1008) 14x18-Grid (d = 2016) LSPI flspi-sa Prashanth L A Convergence rate of TD(0) March 27, / 84

98 Experiments - Traffic Signal Control SGD in Linear Bandits Prashanth L A Convergence rate of TD(0) March 27, / 84

99 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

100 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

101 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

102 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84

103 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

104 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

105 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

106 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

107 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84

108 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

109 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

110 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

111 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84

112 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

113 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

114 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84

115 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

116 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) +

117 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84

118 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84

119 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84

120 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84

121 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84

122 Experiments - Traffic Signal Control Fast GD for Regression θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Solution: Use fast (online) gradient descent (GD) Efficient with complexity of only O(d) (Well-known) High probability bounds with explicit constants can be derived (not fully known) Prashanth L A Convergence rate of TD(0) March 27, / 84

123 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84

124 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84

125 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

126 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

127 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84

128 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

129 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

130 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

131 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84

132 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

133 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

134 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

135 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

136 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

137 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84

138 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

139 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

140 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84

141 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

142 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

143 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

144 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84

145 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

146 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

147 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

148 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84

149 Strongly convex bandits Regret bound for PEGE+fast GD (Strongly Convex Arms): (A3) The function G : θ arg min{θ T x} is J-Lipschitz. x D Theorem Under (A1)-(A3), regret R T := T i=1 x T i θ min x D xt θ satisfies R T CK 1 (n) 2 d 1 ( θ 2 + θ 1 2 ) T The bound is worse than that for PEGE by only a factor of O(log 4 (n)) Prashanth L A Convergence rate of TD(0) March 27, / 84

150 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

151 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

152 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84

153 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

154 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

155 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

156 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84

157 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

158 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

159 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

160 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84

161 News recommendation application Dilbert s boss on news recommendation (and ML) Prashanth L A Convergence rate of TD(0) March 27, / 84

162 News recommendation application Preliminary Results on Complacs News Feed Platform 0 Cumulative reward iteration LinUCB LinUCB-GD Prashanth L A Convergence rate of TD(0) March 27, / 84

163 News recommendation application Experiments on Yahoo! Dataset 1 Figure: The Featured tab in Yahoo! Today module 1 Yahoo User-Click Log Dataset given under the Webscope program (2011) Prashanth L A Convergence rate of TD(0) March 27, / 84

164 News recommendation application Tracking Error Tracking error: SGD Tracking error: SVRG 1 Tracking error: SAG 2 1 SGD 1 SVRG 1 SAG θn θn 0.5 θn θn 0.5 θn θn iteration n of flinucb-gd iteration n of flinucb-svrg iteration n of flinucb-sag Johnson, R., and Zhang, T. (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: NIPS 2 Roux, N. L., Schmidt, M. and Bach, F. (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. arxiv preprint arxiv: Prashanth L A Convergence rate of TD(0) March 27, / 84

165 News recommendation application Runtime Performance on two days of the Yahoo! dataset runtime (ms) ,818 4,933 44, , ,474 Day-2 Day-4 LinUCB flinucb-gd flinucb-svrg flinucb-sag Prashanth L A Convergence rate of TD(0) March 27, / 84

166 For Further Reading For Further Reading I Nathaniel Korda and Prashanth L.A., On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. arxiv: , Prashanth L.A., Nathaniel Korda and Rémi Munos, Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML, Nathaniel Korda, Prashanth L.A. and Rémi Munos, Fast gradient descent for least squares regression: Non-asymptotic bounds and application to bandits. AAAI, Prashanth L A Convergence rate of TD(0) March 27, / 84

167 For Further Reading Dilbert s boss (again) on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of