It is hard to predict, especially about the future. Niels Bohr. You are what you pretend to be, so be careful what you pretend to be.
|
|
- Ethel Craig
- 5 years ago
- Views:
Transcription
1 It is hard to predict, especially about the future. Niels Bohr You are what you pretend to be, so be careful what you pretend to be. Kurt Vonnegut Prashanth L A Convergence rate of TD(0) March 27, / 84
2 Convergence rate of TD(0) with function approximation Prashanth L A Joint work with Nathaniel Korda and Rémi Munos Indian Institute of Science MLRG - Oxford University Google DeepMind March 27, 2015 Prashanth L A Convergence rate of TD(0) March 27, / 84
3 Background Background Prashanth L A Convergence rate of TD(0) March 27, / 84
4 Background Markov Decision Processes (MDPs) MDP: Set of States X, Set of Actions A, Rewards r(x, a) Transition probability: p(s, a, s ) = Pr {s t+1 = s s t = s, a t = a} s t a s t t+1 r t+1 t t+1 Prashanth L A Convergence rate of TD(0) March 27, / 84
5 Background The Controlled Markov Property Controlled Markov Property: i 0, i 1,..., s, s, b 0, b 1...,a, P(s t+1 = s s t = s, a t = a,..., s 0 = i 0, a 0 = b 0 ) = p(s, a, s ) s t a t s t+1 t 2 t 1 t t+1 t+2 Figure: The Controlled Markov Behaviour Prashanth L A Convergence rate of TD(0) March 27, / 84
6 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84
7 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84
8 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84
9 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84
10 Background Value function [ V π (s) = E β t r(s t, π(s t )) s 0 = s, t=0 Value function Reward Policy π ] V π is the fixed point of the Bellman Operator T π : T π (V)(s) := r(s, π(s)) + β s p(s, π(s), s )V(s ) Prashanth L A Convergence rate of TD(0) March 27, / 84
11 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84
12 Background Policy evaluation using TD Temporal difference learning Problem: estimate the value function for a given policy π Solution: Use TD(0) V t+1 (s t ) = V t (s t ) + α t (r t+1 + γv t (s t+1 ) V t (s t )). Why TD(0)? Simulation based algorithms like Monte-Carlo (no model necessary!) Update a guess based on another guess (like DP) Guaranteed convergence to value function V π (s) under standard assumptions Prashanth L A Convergence rate of TD(0) March 27, / 84
13 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84
14 Background TD with Function Approximation Linear Function Approximation. V π (s) θ T φ(s) Parameter θ R d Feature φ(s) R d Note: d << S TD Fixed Point Φ θ = Π T π (Φθ ) Feature Matrix Orthogonal Projection with rows φ(s) T, s S to B = {Φθ θ R d } Prashanth L A Convergence rate of TD(0) March 27, / 84
15 Background TD(0) with function approximation θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) Step-sizes Fixed-point iteration J. N. Tsitsiklis and B.V. Roy. (1997) show that θ n θ a.s., where Aθ = b, where A = Φ T Ψ(I βp)φ and b = Φ T Ψr. 1 J. N. Tsitsiklis and B.V. Roy. (1997) An analysis of temporal-difference learning with function approximation." In: IEEE Transactions on Automatic Control Prashanth L A Convergence rate of TD(0) March 27, / 84
16 Background Assumptions Ergodicity Markov chain induced by the policy π is irreducible and aperiodic. Moreover, there exists a stationary distribution Ψ(= Ψ π ) for this Markov chain. Linear independence Feature matrix Φ has full column rank λ min (Φ T ΨΦ) µ > 0 Bounded rewards r(s, π(s)) 1, for all s S. Bounded features φ(s) 2 1, for all s S. Prashanth L A Convergence rate of TD(0) March 27, / 84
17 Background Assumptions (contd) Step sizes satisfy n γ n =, and n γ 2 n <. Bounded mixing time a non-negative function B( ) such that: s 0 S and m 0, E(φ(s τ ) s 0 ) E Ψ (φ(s τ )) B(s 0 ), τ=0 E[φ(s τ )φ(s τ+m ) T s 0 ] E Ψ [φ(s τ )φ(s τ+m ) T ] B(s 0 ), τ=0 where B( ) satisfies: for any q > 1, there exists a K q < such that E[B q (s) s 0 ] K q B q (s 0 ). Prashanth L A Convergence rate of TD(0) March 27, / 84
18 Concentration bounds: Non-averaged case In the long run we are all dead. John Maynard Keynes Question: What happens in a short run of TD(0) with function approximation? Prashanth L A Convergence rate of TD(0) March 27, / 84
19 Concentration bounds: Non-averaged case Concentration Bounds: Non-averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84
20 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84
21 Concentration bounds: Non-averaged case Non-averaged case: Bound in expectation Step-size choice γ n = Bound in expectation c 2(c + n), with (1 β)2 µc > 1/2 E θ n θ 2 K 1(n) n + c, where K 1 (n) = 2 c θ 0 θ 2 (n + c) 2(1 β)2 µc 1/2 + c(1 β)(3 + 6H)B(s 0) 2(1 β) 2 µc 1 H is an upper bound on θ n 2, for all n. Prashanth L A Convergence rate of TD(0) March 27, / 84
22 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84
23 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84
24 Concentration bounds: Non-averaged case Non-averaged case: High probability bound Step-size choice γ n = c 2(c + n), with (µ(1 β)/2 + 3B(s 0)) c > 1 High-probability bound ( P θ n θ 2 K 2(n) n + c ) 1 δ, where K 2 (n) := (1 β)c ln(1/δ)(1 + 9B(s 0 ) 2 ) (µ(1 β)/2 + 3B(s 0 ) 2 )c 1 K 1 (n) and K 2 (n) above are O(1) + K 1 (n) Prashanth L A Convergence rate of TD(0) March 27, / 84
25 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84
26 Concentration bounds: Non-averaged case Why are these bounds problematic? Obtaining optimal rate O ( 1/ n ) with a step-size γ n = c/(c + n) In expectation: Require c to be chosen such that (1 β) 2 µc (1/2, ) In high-probability: c should satisfy (µ(1 β)/2 + 3B(s 0 )) c > 1. Optimal rate requires knowledge of the mixing bound B(s 0 ) Even for finite state space settings, B(s 0 ) is a constant, albeit one that depends on the transition dynamics! Solution Iterate averaging Prashanth L A Convergence rate of TD(0) March 27, / 84
27 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84
28 Concentration bounds: Non-averaged case Proof Outline Let z n = θ n θ. We first bound the deviation of this error from its mean: P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, and then bound the size of the mean itself: [ E z n 2 2 exp( (1 β)µγ n) z 0 2 }{{} initial error ( n 1 ) 1 ] + (3 + 6H) 2 B(s 0 ) 2 γk+1 2 exp( 2(1 β)µ(γn Γ 2 k+1), k=1 } {{ } sampling and mixing error Note that L i := γ i [ n j=i+1 ( ( (1 2γ j µ 1 β γ ) ))] 1/2 j + [1 + β(3 β)] B(s 0 ) 2 Prashanth L A Convergence rate of TD(0) March 27, / 84
29 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
30 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation Let f Xn (θ) := [r(s n, π(s n)) + βθn 1 T φ(s n+1) θn 1 T φ(sn)]φ(sn). Then, TD update is equivalent to θ n+1 = θ n + γ n [E Ψ (f Xn (θ n)) + ɛ n + M n] (1) Mixing error ɛ n := E(f Xn (θ n) s 0 ) E Ψ (f Xn (θ n)) Martingale sequence M n := f Xn (θ n) E(f Xn (θ n) s 0 ) Unrolling (1), we obtain: Here A := Φ T Ψ(I βp)φ and Π n := z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 n (I γ k A). k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
31 Concentration bounds: Non-averaged case Proof Outline: Bound in Expectation z n+1 = (I γ na)z n + γ n (ɛ n + M n) n = Π nz 0 + γ k Π nπ 1 k (ɛ k + M k ) k=1 By Jensen s inequality, we obtain E( z n 2 s 0 ) (E( z n, z n ) s 0 ) 1 2 ( n 2 Π nz γk 2 Π nπ 1 k 2 ) ( ɛ E k 2 2 s 0 2 k=1 n + 2 γk 2 Π nπ 1 k 2 ) ( M ) 1 E k s 0 2 k=1 Rest of the proof amounts to bounding each of the terms on RHS above. Prashanth L A Convergence rate of TD(0) March 27, / 84
32 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84
33 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84
34 Concentration bounds: Non-averaged case Proof Outline: High Probability Bound Recall z n = θ n θ. Step 1: (Error decomposition) z n 2 E z n 2 = n g i E[g i F i 1 ] = where D i := g i E[g i F i 1 ], g i := E[ z n 2 θ i ], and F i = σ(θ 1,..., θ n). i=1 n D i, i=1 Step 2: (Lipschitz continuity) Functions g i are Lipschitz continuous with Lipschitz constants L i. Step 3: (Concentration inequality) ( n ) P( z n 2 E z n 2 ɛ) = P D i ɛ i=1 ( αλ 2 exp( λɛ) exp 2 n Li 2 i=1 ). Prashanth L A Convergence rate of TD(0) March 27, / 84
35 Concentration bounds: Iterate Averaging Concentration Bounds: Iterate Averaged TD(0) Prashanth L A Convergence rate of TD(0) March 27, / 84
36 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84
37 Concentration bounds: Iterate Averaging Polyak-Ruppert averaging: Bound in expectation Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n with α (1/2, 1) and c > 0 Bound in expectation E θ n ˆθ 2 T KIA 1 (n), where (n + c) α/2 [ ] K1 A (n) := θ B(s 0 ) 2 θ 2 (n + c) (1 α)/2 + 2β(1 β)cα HB(s 0 ) (µc α (1 β) 2 ) α 1+2α 2(1 α) Prashanth L A Convergence rate of TD(0) March 27, / 84
38 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84
39 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where ( ) (1 + 9B(s0 ) 2 ) [ 2α ] + 2(3α ) µ 1 β +B(s K2 A 2 0 ) c α α (n) := [ ] + K 1 (n) µ B(s 0) n 1 β (1 α)/2 Prashanth L A Convergence rate of TD(0) March 27, / 84
40 Concentration bounds: Iterate Averaging Iterate averaging: High probability bound Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, where α can be chosen arbitrarily close to 1, resulting in a rate O ( 1/ n ). Prashanth L A Convergence rate of TD(0) March 27, / 84
41 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84
42 Concentration bounds: Iterate Averaging Proof Outline Let θ n+1 := (θ θ n)/n and z n = θ n+1 θ. Then, P( z n 2 E z n 2 ɛ) exp ɛ2 2 n Li 2 i=1, ɛ > 0, where L i := γ ( i 1 + n n 1 l l=i+1 j=i With γ n = (1 β)(c/(c + n)) α, we obtain ( ( (1 2γ j µ 1 β γ ) ) j + [1 + β(3 β)] B(s 0 )) ) α [ ] + 5 α 1 β n µ + B(s 0 ) c α α Li 2 2 [ i=1 1 µ B(s ] 2 1 0) n 1 β Prashanth L A Convergence rate of TD(0) March 27, / 84
43 Concentration bounds: Iterate Averaging Proof outline: Bound in expectation To bound the expected error we directly average the errors of the non-averaged iterates: E θn+1 θ 2 1 n n E θ k θ 2, and then specialise to the choice of step-size: γ n = (1 β)(c/(c + n)) α E θn+1 θ ( 1 + 9B(s0 ) 2 exp( µc(n + c) 1 α ) θ 0 θ n 2 n=1 +2βHc α (1 β) ( µc α (1 β) 2) ) α 1+2α 2(1 α) (n + c) α 2 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
44 Centered TD(0) Centered TD (CTD) Prashanth L A Convergence rate of TD(0) March 27, / 84
45 Centered TD(0) The Variance Problem Why does iterate averaging work? in TD(0), each iterate introduces a high variance, which must be controlled by the step-size choice averaging the iterates reduces the variance of the final estimator reduced variance allows for more exploration within the iterates through larger step sizes Prashanth L A Convergence rate of TD(0) March 27, / 84
46 Centered TD(0) A Control Variate Solution Centering: another approach to variance reduction instead of averaging iterates one can use an average to guide the iterates now all iterates are informed by their history constructing this average in epochs allows a constant step-size choice Prashanth L A Convergence rate of TD(0) March 27, / 84
47 Centered TD(0) Centering: The Idea Recall that for TD(0), θ n+1 = θ n + γ n (r(s n, π(s n )) + βθ T nφ(s n+1 ) θ T nφ(s n )) φ(s n ) }{{} =f n(θ n) and that θ n θ, the solution of F(θ) := ΠT π (Φθ) Φθ = 0. Centering each iterate: θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) Prashanth L A Convergence rate of TD(0) March 27, / 84
48 Centered TD(0) Centering: The Idea Why Centering helps? No updates after hitting θ θ n+1 = θ n + γ f n (θ n ) f n ( θ n ) + F( θ n ) }{{} (*) An average guides the updates, resulting in low variance of term (*) Allows using a (large) constant step-size O(d) update - same as TD(0) Working with epochs need to store only the averaged iterate θ n and an estimate of ˆF( θ n ) Prashanth L A Convergence rate of TD(0) March 27, / 84
49 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84
50 Centered TD(0) Centering: The Idea Centered update: θ n+1 = θ n + γ ( f n (θ n ) f n ( θ n ) + F( θ n ) ) Challenges compared to gradient descent with a accessible cost function F is unknown and inaccessible in our setting To prove convergence bounds one has to cope with the error due to incomplete mixing Prashanth L A Convergence rate of TD(0) March 27, / 84
51 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84
52 Centered TD(0) θ (m), ˆF (m) ( θ (m) ) θ n Take action π(s n) Update θ n using (2) θ n+1 θ (m+1), ˆF (m+1) ( θ (m+1) ) Centering Simulation Fixed point iteration Centering Epoch Run Beginning of each epoch, an iterate θ (m) is chosen uniformly at random from the previous epoch Epoch run Set θ mm := θ (m), and, for n = mm,..., (m + 1)M 1 ( ) θ n+1 = θ n + γ f Xin (θ n) f Xin ( θ (m) ) + ˆF (m) ( θ (m) ), where ˆF (m) (θ) := 1 mm f Xi (θ) M i=(m 1)M (2) Prashanth L A Convergence rate of TD(0) March 27, / 84
53 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84
54 Centered TD(0) Centering: Results Epoch length and step size choice Choose M and γ such that C 1 < 1, where ( 1 C 1 := 2µγM((1 β) d 2 γ) + γd 2 ) 2((1 β) d 2 γ) Error bound Φ( θ (m) θ ) 2 Ψ Cm 1 ( ) Φ( θ (0) θ ) 2 Ψ m 1 +C 2 H(5γ + 4) C (m 2) k 1 B km (k 1)M (s 0), k=1 where C 2 = γ/(2m((1 β) d 2 γ)) and B km (k 1)M and km i=(k 1)M is an upper bound on the partial sums km (E(φ(s i )φ(s i+l ) s 0 ) E Ψ (φ(s i )φ(s i+l ) T )), for l = 0, 1. i=(k 1)M (E(φ(s i ) s 0 ) E Ψ (φ(s i ))) Prashanth L A Convergence rate of TD(0) March 27, / 84
55 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84
56 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84
57 Centered TD(0) Centering: Results cont. The effect of mixing error If the Markov chain underlying policy π satisfies the following property: then Φ( θ (m) θ ) 2 Ψ Cm 1 P(s t = s s 0 ) ψ(s) Cρ t/m, ( ) Φ( θ (0) θ ) 2 Ψ + CMC 2 H(5γ + 4) max{c 1, ρ M } (m 1) When the MDP mixes exponentially fast (e.g. finite state-space MDPs) we get the exponential convergence rate (* only in the first term) Otherwise the decay of the error is dominated by the mixing rate Prashanth L A Convergence rate of TD(0) March 27, / 84
58 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84
59 Centered TD(0) Proof Outline Let f Xin (θ n) := f Xin (θ n) f Xin ( θ (m) ) + E Ψ (f Xin ( θ (m) )). Step 1: (Rewriting CTD update) ) θ n+1 = θ n + γ ( f (θn) + Xin ɛn where ɛ n := E(f Xin ( θ (m) ) F mm ) E Ψ (f Xin ( θ (m) )) Step 2: (Bounding the variance of centered updates) ( E Ψ f Xin (θ ) n) ( ) 2 d 2 Φ(θ 2 n θ ) 2 Ψ + Φ( θ (m) θ ) 2 Ψ Prashanth L A Convergence rate of TD(0) March 27, / 84
60 Centered TD(0) Proof Outline Step 3: (Analysis for a particular epoch) ] E θn θ n+1 θ 2 2 θn θ γ2 E θn ɛ n γ(θn θ ) T E θn [ f Xin (θ ] [ n) + γ 2 E θn f Xin (θ n) 2 ( 2 ) θ n θ 2 2 2γ((1 β) d2 γ) Φ(θ n θ ) 2 Ψ + γ2 d 2 Φ( θ (m) θ ) 2 Ψ + γ 2 E θn ɛ n 2 2 Summing the above inequality over an epoch and noting that E Ψ,θn θ n+1 θ and ( θ (m) θ ) T I( θ (m) θ ) 1 µ ( θ (m) θ ) T Φ T ΨΦ( θ (m) θ ), we obtain the following by setting θ 0 = θ (m) : ( ) 1 ( ) 2γM((1 β) d 2 γ) Φ( θ (m+1) θ ) 2 Ψ µ + γ2 Md 2 Φ( θ (m) θ ) 2 Ψ +γ 2 mm i=(m 1)M E θi ɛ i 2 2 The final step is to unroll (across epochs) the final recursion above to obtain the rate for CTD. Prashanth L A Convergence rate of TD(0) March 27, / 84
61 Centered TD(0) TD(0) on a batch Prashanth L A Convergence rate of TD(0) March 27, / 84
62 Centered TD(0) Dilbert s boss on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84
63 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
64 fast LSTD LSTD - A Batch Algorithm Given dataset D := {(s i, r i, s i), i = 1,..., T)} LSTD approximates the TD fixed point by ˆθ T = Ā 1 T b T, O(d 2 T) Complexity where Ā T = 1 T b T = 1 T T φ(s i )(φ(s i ) βφ(s i)) T i=1 T r i φ(s i ). i=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
65 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84
66 fast LSTD Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure: LSPI - a batch-mode RL algorithm for control LSTD Complexity O(d 2 T) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Prashanth L A Convergence rate of TD(0) March 27, / 84
67 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84
68 fast LSTD Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d 10 6 ) solving LSTD is computationally intensive Related works: GTD 1, GTD2 2, ilstd 3 Solution Use stochastic approximation (SA) Complexity O(dT) O(d) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1 Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2 Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3 Geramifard A et al. (2007) ilstd: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Convergence rate of TD(0) March 27, / 84
69 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84
70 fast LSTD Fast LSTD using Stochastic Approximation θ n Pick i n uniformly in {1,..., T} Update θ n using (s in, r in, s i n ) θ n+1 Random Sampling SA Update Update rule: θ n = θ n 1 + γ n ( rin + βθ T n 1φ(s i n ) θ T n 1φ(s in ) ) φ(s in ) Step-sizes Fixed-point iteration Complexity: O(d) per iteration Prashanth L A Convergence rate of TD(0) March 27, / 84
71 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84
72 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84
73 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84
74 fast LSTD Assumptions Setting: Given dataset D := {(s i, r i, s i), i = 1,..., T)} (A1) φ(s i ) 2 1 (A2) r i R max < ( ) 1 T (A3) λ min φ(s i )φ(s i ) T µ. T i=1 Bounded features Bounded rewards Co-variance matrix has a min-eigenvalue Prashanth L A Convergence rate of TD(0) March 27, / 84
75 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84
76 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84
77 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84
78 fast LSTD Convergence Rate Step-size choice γ n = (1 β)c 2(c + n), with (1 β)2 µc (1.33, 2) Bound in expectation E θ n ˆθ 2 T K 1 n + c High-probability bound ( ) θn P ˆθ T K 2 1 δ, 2 n + c By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Convergence rate of TD(0) March 27, / 84
79 fast LSTD The constants θ0 c ˆθ 2 T K 1 (n) = n + (1 β)ch2 (n), ((1 β)2 µc 1)/2 2 K 2 (n) = (1 β)c log δ 1 ( ) + K 1(n), (1 β)2 µc 1 where ( ( θ0 h(k) :=(1 + R max + β) 2 max ˆθ ) ) 2 T + ln n ˆθT, 1 Both K 1 (n) and K 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84
80 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84
81 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84
82 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84
83 fast LSTD Iterate Averaging Bigger step-size + Averaging γ n := (1 β) 2 ( ) c α θn+1 := (θ θ n )/n c + n Bound in expectation E θ n ˆθ 2 T KIA 1 (n) (n + c) α/2 High-probability bound ( P θn ˆθ T 2 (n) (n + c) α/2 2 KIA ) 1 δ, Dependency of c on µ is removed dependency at the cost of (1 α)/2 in the rate. Prashanth L A Convergence rate of TD(0) March 27, / 84
84 fast LSTD The constants C θ 0 ˆθ 2 T K1 IA (n) := (n + c) (1 α)/2 + h(n)c α (1 β), and (µc α (1 β) 2 ) α 1+2α 2(1 α) [ [ ] ] log δ K2 IA (n) := 1 3 α 2α 2 + µ(1 β) µc α (1 β) 2 + 2α α 1 + KIA (1 α)/2 (n + c) 1 (n). As before, both K IA 1 (n) and K IA 2 (n) are O(1) Prashanth L A Convergence rate of TD(0) March 27, / 84
85 fast LSTD Performance bounds True value function v Approximate value function ṽ n := Φθ n ( ) ( ) v ṽn T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 T 2 f T := T 1 f (s i ) 2, for any function f. i=1 2 Lazaric, A., Ghavamzadeh, M., Munos, R. (2012) Finite-sample analysis of least-squares policy iteration. In: JMLR Prashanth L A Convergence rate of TD(0) March 27, / 84
86 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84
87 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84
88 fast LSTD Performance bounds ( ) ( ) v ṽ n T v Πv T d 1 + O 1 β 2 (1 β) 2 + O µt (1 β) 2 µ 2 n ln 1 δ }{{}}{{}}{{} approximation error estimation error computational error 1 Artifacts of function approximation and least squares methods Consequence of using SA for LSTD Setting n = ln(1/δ)t/(dµ), the convergence rate is unaffected! Prashanth L A Convergence rate of TD(0) March 27, / 84
89 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84
90 Fast LSPI using SA LSPI - A Quick Recap Policy Evaluation Policy π Q-value Q π Policy Improvement [ ] Q π (s, a) = E β t r(s t, π(s t )) s 0 = s, a 0 = a t=0 π (s) = arg max θ T φ(s, a) a A Prashanth L A Convergence rate of TD(0) March 27, / 84
91 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84
92 Fast LSPI using SA Policy Evaluation: LSTDQ and its SA variant Given a set of samples D := {(s i, a i, r i, s i), i = 1,..., T)} LSTDQ approximates Q π by ˆθ T = Ā 1 b T T where Ā T = 1 T φ(s i, a i )(φ(s i, a i ) βφ(s T i, π(s i))) T, and b T = T 1 i=1 T r i φ(s i, a i ). i=1 Fast LSTDQ using SA: θ k = θ k 1 + γ k ( rik + βθ T k 1φ(s i k, π(s i k )) θ T k 1φ(s ik, a ik ) ) φ(s ik, a ik ) Prashanth L A Convergence rate of TD(0) March 27, / 84
93 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84
94 Fast LSPI using SA Fast LSPI using SA (flspi-sa) Input: Sample set D := {s i, a i, r i, s i} T i=1 repeat Policy Evaluation For k = 1 to τ - Get random sample index: i k U({1,..., T}) - Update flstd-sa iterate θ k θ θ τ, = θ θ 2 Policy Improvement Obtain a greedy policy π (s) = arg max θ T φ(s, a) a A θ θ, π π until < ɛ Prashanth L A Convergence rate of TD(0) March 27, / 84
95 Experiments - Traffic Signal Control The traffic control problem Prashanth L A Convergence rate of TD(0) March 27, / 84
96 Experiments - Traffic Signal Control Simulation Results on 7x9-grid network Tracking error θ k ˆθ T 2 Throughput (TAR) θ k ˆθT TAR step k of flstd-sa 0 LSPI flspi-sa 0 1,000 2,000 3,000 4,000 5,000 time steps Prashanth L A Convergence rate of TD(0) March 27, / 84
97 Experiments - Traffic Signal Control Runtime Performance on three road networks runtime (ms) ,144 4, x9-Grid (d = 504) 14x9-Grid (d = 1008) 14x18-Grid (d = 2016) LSPI flspi-sa Prashanth L A Convergence rate of TD(0) March 27, / 84
98 Experiments - Traffic Signal Control SGD in Linear Bandits Prashanth L A Convergence rate of TD(0) March 27, / 84
99 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84
100 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84
101 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84
102 Experiments - Traffic Signal Control Complacs News Recommendation Platform NOAM database: 17 million articles from 2010 Task: Find the best among 2000 news feeds Reward: Relevancy score of the article Feature dimension: (approx) 1 In collaboration with Nello Cristianini and Tom Welfare at University of Bristol Prashanth L A Convergence rate of TD(0) March 27, / 84
103 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84
104 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84
105 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84
106 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84
107 Experiments - Traffic Signal Control More on relevancy score Problem: Find the best news feed for Crime stories Sample scores: Five dead in Finnish mall shooting Score: 1.93 Holidays provide more opportunities to drink Score: 0.48 Russia raises price of vodka Score: 2.67 Why Obama Care Must Be Defeated Score: 0.43 University closure due to weather Score: 1.06 Prashanth L A Convergence rate of TD(0) March 27, / 84
108 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84
109 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84
110 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84
111 Experiments - Traffic Signal Control A linear bandit algorithm x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Estimate UCBs Regression used to compute UCB(x) := x T ˆθn + α x T A 1 n x Prashanth L A Convergence rate of TD(0) March 27, / 84
112 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84
113 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84
114 Experiments - Traffic Signal Control UCB values Mean-reward estimate UCB(x) = ˆµ(x) + α ˆσ(x) Confidence width At each round t, select a tap. Optimize the quality of n selected beers Prashanth L A Convergence rate of TD(0) March 27, / 84
115 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84
116 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84
117 Experiments - Traffic Signal Control UCB values Linearity No need to estimate mean-reward of all arms, estimating θ is enough Regression θ n = A 1 n bn UCB(x) = µ (x) + α σ (x) Mahalanobis distance of x from q 1 An : xt An x Optimize the beer you drink, before you get drunk Prashanth L A Convergence rate of TD(0) March 27, / 84
118 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84
119 Experiments - Traffic Signal Control Performance measure Best arm: x = arg min{x T θ }. x T Regret: R T = (x x i ) T θ i=1 Goal: ensure R T grows sub-linearly with T Linear bandit algorithms ensure sub-linear regret! Prashanth L A Convergence rate of TD(0) March 27, / 84
120 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84
121 Experiments - Traffic Signal Control Complexity of Least Squares Regression Choose x n Observe y n Estimate ˆθ n Figure: Typical ML algorithm using Regression Regression Complexity O(d 2 ) using the Sherman-Morrison lemma or O(d ) using the Strassen algorithm or O(d ) the Coppersmith-Winograd algorithm Problem: Complacs News feed platform has high-dimensional features (d 10 5 ) solving OLS is computationally costly Prashanth L A Convergence rate of TD(0) March 27, / 84
122 Experiments - Traffic Signal Control Fast GD for Regression θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Solution: Use fast (online) gradient descent (GD) Efficient with complexity of only O(d) (Well-known) High probability bounds with explicit constants can be derived (not fully known) Prashanth L A Convergence rate of TD(0) March 27, / 84
123 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84
124 Experiments - Traffic Signal Control Bandits+GD for News Recommendation LinUCB: a well-known contextual bandit algorithm that employs regression in each iteration Fast GD: provides good approximation to regression (with low computational cost) Strongly-Convex Bandits: no loss in regret except log-factors Proved! Non Strongly-Convex Bandits: Encouraging empirical results for linucb+fast GD] on two news feed platforms Prashanth L A Convergence rate of TD(0) March 27, / 84
125 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84
126 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84
127 Strongly convex bandits fast GD θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update Step-sizes θ n = θ n 1 + γ n ( yin θ T n 1x in ) xin Sample gradient Prashanth L A Convergence rate of TD(0) March 27, / 84
128 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84
129 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84
130 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84
131 Strongly convex bandits Assumptions Setting: y n = x T nθ + ξ n, where ξ n is i.i.d. zero-mean (A1) sup x n 2 1. n (A2) ξ n 1, n. ( ) 1 n 1 (A3) λ min x i xi T µ. n i=1 Bounded features Bounded noise Strongly convex co-variance matrix (for each n)! Prashanth L A Convergence rate of TD(0) March 27, / 84
132 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
133 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
134 Strongly convex bandits Why deriving error bounds is difficult? θ n ˆθ n =θ n ˆθ n 1 + ˆθ n 1 ˆθ n =θ n 1 ˆθ n 1 + ˆθ n 1 ˆθ n + γ n (y in θn 1x T in )x in n n = Π n (θ 0 θ ) + γ k Π n Π 1 k M k Π n Π 1 k (ˆθ k }{{} ˆθ k 1 ), k=1 k=1 Initial Error } {{ } Sampling Error } {{ } Drift Error Present in earlier SGD works and can be handled easily Consequence of changing target Hard to control! Note: Ā n = 1 n n x i x T ( ) i, Πn := I γk Ā k, and M k is a martingale difference. n i=1 k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
135 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84
136 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84
137 Strongly convex bandits Handling Drift Error Note F n (θ) := 1 2 n (y i θ T x i ) 2 and Ā n = 1 n i=1 n x i xi T. Also, E[y n x n ] = xnθ T. i=1 To control the drift error, we observe that ( ) F n (ˆθ n ) = 0 = F n 1 (ˆθ n 1 ) = (ˆθn 1 ˆθ ) n = ξ n A 1 n 1 x n (xn(ˆθ T n θ ))A 1 n 1 x n. Thus, drift is controlled by the convergence of ˆθ n to θ Key: confidence ball result 1 1 Dani, Varsha, Thomas P. Hayes, and Sham M. Kakade, (2008) "Stochastic Linear Optimization under Bandit Feedback." In: COLT Prashanth L A Convergence rate of TD(0) March 27, / 84
138 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84
139 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84
140 Strongly convex bandits Error bound With γ n = c/(4(c + n)) and µc/4 (2/3, 1) we have: High prob. bound For any δ > 0, P θn ˆθ 2 Kµ,c n n log 1 δ + h 1(n) n 1 δ. Bound in expectation Optimal rate O (n 1/2) E θ n ˆθ n 2 θ 0 ˆθ 2 n n µc + h 2(n). n Initial error Sampling error 1 Kµ,c is a constant depending on µ and c and h 1 (n), h 2 (n) hide log factors. 2 By iterate-averaging, the dependency of c on µ can be removed. Prashanth L A Convergence rate of TD(0) March 27, / 84
141 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84
142 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84
143 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84
144 Strongly convex bandits PEGE Algorithm 1 Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, compute OLS Use OLS estimate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). ˆθ md = 1 1 d b i b T m d i b i y j (i). m i=1 i=1 j=1 Exploitation Phase Find x = arg min{ˆθ md T x} x D Choose arm x m times consecutively. 1 P. Rusmevichientong and J,N. Tsitsiklis, (2010) Linearly Parameterized Bandits. In: Math. Oper. Res. Prashanth L A Convergence rate of TD(0) March 27, / 84
145 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84
146 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84
147 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84
148 Strongly convex bandits PEGE Algorithm with fast GD Input A basis {b 1,..., b d } D for R d. Pull each of the d basis arms once Using losses, update fast GD iterate Use fast GD iterate to compute a greedy decision Pull the greedy arm m times For each cycle m = 1, 2,... do Exploration Phase For i = 1 to d - Choose arm b i - Observe y i (m). Update fast GD iterate θ md Exploitation Phase Find x = arg min{θmd T x} x D Choose arm x m times consecutively. Prashanth L A Convergence rate of TD(0) March 27, / 84
149 Strongly convex bandits Regret bound for PEGE+fast GD (Strongly Convex Arms): (A3) The function G : θ arg min{θ T x} is J-Lipschitz. x D Theorem Under (A1)-(A3), regret R T := T i=1 x T i θ min x D xt θ satisfies R T CK 1 (n) 2 d 1 ( θ 2 + θ 1 2 ) T The bound is worse than that for PEGE by only a factor of O(log 4 (n)) Prashanth L A Convergence rate of TD(0) March 27, / 84
150 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84
151 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84
152 Non-strongly convex bandits Fast linucb x n := arg max UCB(x) x D Rewards y n s.t. E[y n x n ] = x T nθ Choose x n Observe y n Use θ n to estimate ˆθ n Fast GD used to compute UCB(x) := x T θ n + α x T φ (x) n Prashanth L A Convergence rate of TD(0) March 27, / 84
153 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84
154 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84
155 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84
156 Non-strongly convex bandits Adaptive regularization ( 1 n 1 Problem: In many settings, λ min n i=1 Solution: Adaptively regularize with λ n x i x T i ) µ may not hold. 1 θ n := arg min θ 2n n (y i θ T x i ) 2 + λ n θ 2 i=1 θ n Pick i n uniformly in {1,..., n} Update θ n using (x in, y in ) θ n+1 Random Sampling GD Update GD update: θ n = θ n 1 + γ n ((y in θ T n 1x in )x in λ n θ n 1 ) Prashanth L A Convergence rate of TD(0) March 27, / 84
157 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
158 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
159 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
160 Non-strongly convex bandits Why deriving error bounds is really difficult here? Need θ n θ n = Π n (θ 0 θ ) }{{} Initial Error n Π n Π 1 k ( θ k θ k 1 ) + k=1 } {{ } Drift Error n γ k λ k to bound the initial error k=1 n k=1 γ k Π n Π 1 k } {{ } Sampling Error M k, (3) Set γ n = O(n α ) (forcing λ n = Ω(n (1 α) )) Bad news: This choice when plugged into (3) results in only a constant error bound! n ( Note: Π n := I γk (Ā k + λ k I) ) and θ n 1 θ n = Ω(n 1 ), whenever α (0, 1) k=1 Prashanth L A Convergence rate of TD(0) March 27, / 84
161 News recommendation application Dilbert s boss on news recommendation (and ML) Prashanth L A Convergence rate of TD(0) March 27, / 84
162 News recommendation application Preliminary Results on Complacs News Feed Platform 0 Cumulative reward iteration LinUCB LinUCB-GD Prashanth L A Convergence rate of TD(0) March 27, / 84
163 News recommendation application Experiments on Yahoo! Dataset 1 Figure: The Featured tab in Yahoo! Today module 1 Yahoo User-Click Log Dataset given under the Webscope program (2011) Prashanth L A Convergence rate of TD(0) March 27, / 84
164 News recommendation application Tracking Error Tracking error: SGD Tracking error: SVRG 1 Tracking error: SAG 2 1 SGD 1 SVRG 1 SAG θn θn 0.5 θn θn 0.5 θn θn iteration n of flinucb-gd iteration n of flinucb-svrg iteration n of flinucb-sag Johnson, R., and Zhang, T. (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: NIPS 2 Roux, N. L., Schmidt, M. and Bach, F. (2012) A stochastic gradient method with an exponential convergence rate for finite training sets. arxiv preprint arxiv: Prashanth L A Convergence rate of TD(0) March 27, / 84
165 News recommendation application Runtime Performance on two days of the Yahoo! dataset runtime (ms) ,818 4,933 44, , ,474 Day-2 Day-4 LinUCB flinucb-gd flinucb-svrg flinucb-sag Prashanth L A Convergence rate of TD(0) March 27, / 84
166 For Further Reading For Further Reading I Nathaniel Korda and Prashanth L.A., On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. arxiv: , Prashanth L.A., Nathaniel Korda and Rémi Munos, Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. ECML, Nathaniel Korda, Prashanth L.A. and Rémi Munos, Fast gradient descent for least squares regression: Non-asymptotic bounds and application to bandits. AAAI, Prashanth L A Convergence rate of TD(0) March 27, / 84
167 For Further Reading Dilbert s boss (again) on big data! Prashanth L A Convergence rate of TD(0) March 27, / 84
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results
Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of
More informationReinforcement Learning
Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationReinforcement Learning
Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation
More informationLecture 7: Value Function Approximation
Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,
More informationilstd: Eligibility Traces and Convergence Analysis
ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca
More informationReinforcement Learning
Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationOnline Learning Schemes for Power Allocation in Energy Harvesting Communications
Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationA Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation
A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence
More informationVariance Reduction for Policy Gradient Methods. March 13, 2017
Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationActive Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning
Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer
More informationReinforcement Learning with Function Approximation. Joseph Christian G. Noel
Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is
More informationReinforcement learning an introduction
Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationLecture 4: Approximate dynamic programming
IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationNew Algorithms for Contextual Bandits
New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised
More informationDual Temporal Difference Learning
Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationarxiv: v1 [cs.lg] 23 Oct 2017
Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationOff-Policy Actor-Critic
Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationA Gentle Introduction to Reinforcement Learning
A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationActor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017
Actor-critic methods Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department February 21, 2017 1 / 21 In this lecture... The actor-critic architecture Least-Squares Policy Iteration
More informationCS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study
CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationReinforcement Learning
Reinforcement Learning Function approximation Daniel Hennes 19.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns Forward and backward view Function
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationSequential and reinforcement learning: Stochastic Optimization I
1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationReinforcement Learning
Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationBits of Machine Learning Part 2: Unsupervised Learning
Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationCMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationBandits : optimality in exponential families
Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationWE consider finite-state Markov decision processes
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider
More informationarxiv: v1 [cs.sy] 29 Sep 2016
A Cross Entropy based Stochastic Approximation Algorithm for Reinforcement Learning with Linear Function Approximation arxiv:1609.09449v1 cs.sy 29 Sep 2016 Ajin George Joseph Department of Computer Science
More informationOptimal Stopping Problems
2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationLecture 10 - Planning under Uncertainty (III)
Lecture 10 - Planning under Uncertainty (III) Jesse Hoey School of Computer Science University of Waterloo March 27, 2018 Readings: Poole & Mackworth (2nd ed.)chapter 12.1,12.3-12.9 1/ 34 Reinforcement
More informationRegularization and Feature Selection in. the Least-Squares Temporal Difference
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,
More informationGeneralization and Function Approximation
Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationChapter 13 Wow! Least Squares Methods in Batch RL
Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationLecture 17: Reinforcement Learning, Finite Markov Decision Processes
CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationLeast-Squares Methods for Policy Iteration
Least-Squares Methods for Policy Iteration Lucian Buşoniu, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, Robert Babuška, and Bart De Schutter Abstract Approximate reinforcement learning deals with
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationMulti-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang
Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationLarge-scale Information Processing, Summer Recommender Systems (part 2)
Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a
More information