TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory

Size: px

Start display at page:

Download "TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory"

Alexander Rodgers
5 years ago
Views:

1 TD Learning Sean Meyn Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory Joint work with I.-K. Cho, Illinois S. Mannor, McGill V. Tadic, Sheffield S. Henderson, Cornell NSF support: ECS and DARPA ITMANET

2 Art Recollections - A. Rybko, 2006

3 Recollections Value Function Solidarity Relative value function h(x) = 0 E[c(Q(t; x)) η] dt η= c(x) π(dx) Fluid value function J(x) = - A. Rybko, 2006 Art lim x 0 c(q(t; x)) dt J(x) =1 h(x)

4 Recollections Lyapunov Functions, and Policy Translation x2 Relative value function h(x) = 0 (x) = E[c(Q(t; x)) η] dt η= q2 Fluid value function J(x) = - A. Rybko, 2006 Art PV α1 µ2 0 (x) = c(x) π(dx) µ1 +α1 µ2 +µ1 c(q(t; x)) dt J(x) lim =1 h(x) (x)x := E[V (Q(k + 1)) (x) = µ1 +α1 µ1 Q(k) = x] V (x) Solidarity ε x + η Value Function x1

5 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics κ =1 κ =4 Conclusions n

6 Outline Value function approximation Performance evaluation Perfromance improvement Simulation TD learning w 2 Infeasible R {h (x) =h 0 } w 1 Numerics Conclusions κ =1 κ = n

7 Average cost { X(t) :t =0,1,2,... } P (x,y) πp = π c: X R + η = π(c) Ergodic Markov chain on X (countable) Transition matrix Invariant probability measure Cost function (near monotone) Finite average cost η = π(c) = lim n 1 n n c(x(t)) t=0

8 Value functions Value functions: Discounted [ ] h(x):=e (1 + γ) t 1 c(x(t)) t =0 Relative [ τ 0 ] h(x):=e (c(x(t)) η) t =0 X(0) = x X

9 Value functions Value functions: Discounted [ ] h(x):=e (1 + γ) t 1 c(x(t)) t =0 Relative [ τ 0 ] h(x):=e (c(x(t)) η) t =0 X(0) = x X Dynamic programming equations: Discounted [γi D]h = c D = P I Relative D h = c η

10 Value function approximation Discounted-cost value function: [ ] h(x) =E (1 + γ) t 1 c(x(t)) t =0 Network model with c(x) a norm on X Under general conditions, approximated by scaled cost: lim r 1 r h(rx) = 1 γ c(x)

11 Value function approximation Relative value function, [ τ 0 ] h(x) = E (c(x(t)) η) t =0 h solves Poisson s equation, D h = c + η

12 Value function approximation Value function approximated by fluid model: [ τ 0 ] h(x) = E (c(x(t)) η) J(x) = t =0 0 c(q(t)) dt Relative value function Fluid value function Under general conditions, lim r 1 h(rx) =J(x) r2

13 Value function approximation Value function approximated by fluid model: [ τ 0 ] h(x) =E (c(x(t)) η) J(x) = t =0 0 c(q(t)) dt Infeasible R Workload model: level sets of h ι 2 (k) = ι 2 (k) > 0 20 {h (x) =h 0 }

14 Value function approximation: Simulation Simulation: Standard Monte Carlo, η k = k 1 k 1 t=0 c(x(t)) CLT variance: σ 2 = π(h 2 (Ph) 2 ) where h solves Poisson s equation D h = c + η

15 Value function approximation: Simulation Simulation: Standard Monte Carlo, η k = k 1 k 1 t=0 c(x(t)) CLT variance: σ 2 = π(h 2 (Ph) 2 ) where h solves Poisson s equation D h = c + η Example: Simulating the single queue: CLT variance = O 1 (1 ρ) 4

16 Value function approximation: Simulation η n n 1 = n 1 t=0 Q(t) lim log P{η n r} 0, n r > η n Example: Simulating the single queue: CLT variance = O 1 (1 ρ) 4

17 Value function approximation: Simulation Control variates: g : X R has known mean π(g )=0, Smoothed estimator: η k (ϑ) =k 1 k 1 t=0 c(x(t)) ϑg (X(t)), k 1 Perfect estimator: g = h Ph = Dh = c η Zero CLT variance

18 Steady state customer population Smoothed estimator using fluid value function: g = h Ph = Dh, h = J x 10 Standard estimator Smoothed estimator

19 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics Conclusions κ =1 κ = n

20 Value function approximation Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 How to find the best parameter θ? How to define best?

21 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π

22 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π = 2 E π [(h θ (X(k)) h(x(t))) ] θ h θ h 2 π =2E π [(h θ (X(k)) h(x(t))) ψ (X(t))] Setting the gradient to zero gives optimal parameter

23 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π = 2 E π [(h θ (X(k)) h(x(t))) ] θ h θ h 2 π =2E π [(h θ (X(k)) h(x(t))) ψ (X(t))] Setting the gradient to zero gives optimal parameter θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

24 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg)

25 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ]

26 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0

27 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0 b = E π [h(x)ψ(x)] = R γ c,ψ π

28 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0 Adjoint, R γ f,g π = f,r γ g π, f, g L 2 (π)

29 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) (discounted cost)

30 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))]

31 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))] R γ f, g π = t=0 (1 + γ) t 1 E[f(X(0))g(X( t))]

32 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))] R γ f, g π = t=0 (1 + γ) t 1 E[f(X(0))g(X( t))] = f, R γ g π

33 Representing the adjoint Adjoint is defined in terms of the time-reversed process R γ g (x) = (1 + γ) t 1 E[ g(x( t)) X(0) = x ] t=0 Causal computable!

34 TD Learning for discounted cost Causal computable! 1 θ 2 h θ h 2 π = E π [(h θ (X(k)) h(x(t))) ψ (X(t))] = (h θ R γ c), ψ π = R γ (Rγ 1 h θ c), ψ π = Rγ 1 h θ c, R γ ψ π

35 TD Learning for discounted cost Causal computable! 1 θ 2 h θ h 2 π = E π [(h θ (X(k)) h(x(t))) ψ (X(t))] = (h θ R γ c), ψ π = R γ (Rγ 1 h θ c), ψ π = Rγ 1 h θ c, R γ ψ π = d, P ϕ π Notation

36 TD Learning for discounted cost Causal computable! E[d(k)ϕ(k +1)] 1 2 θ h θ h γ 2 π, θ(i) θ 1 θ 2 h θ h 2 π = E π [(h θ (X(k)) h(x(t))) ψ (X(t))] = (h θ R γ c), ψ π = R γ (Rγ 1 h θ c), ψ π = Rγ 1 h θ c, R γ ψ π = d, P ϕ π Notation

37 TD Learning for discounted cost Causal computable! E[d(k)ϕ(k + 1)] 1 2 θ h θ h γ 2 π, θ(i) θ (i) Temporal differences defined by, d(k) = [(1 + γ)h θ(k) (X(k)) h θ(k) (X(k + 1)) c(x(k))] (ii) Eligibility vectors: the sequence of d-dimensional vectors, ϕ(k) = k (1 + γ) t 1 ψ(x(k t)), k 1 t=0

38 Algorithms TD(1) algorithm θ(k +1) θ(k) =a k d(k)ϕ(x(k +1)), k 0 a k stochastic approx. gain

39 Algorithms TD(1) algorithm θ(k +1) θ(k) =a k d(k)ϕ(x(k +1)), k 0 a k stochastic approx. gain Consistent under mild conditions, θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

40 Algorithms Newton algorithm θ(k +1)= 1 k k i=1 ψ(x(i +1))ψ T (X(i +1)) 1 1 k k i=1 c(i)ϕ(x(i +1)) Consistent under mild conditions, θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

41 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X

42 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X Solves Poisson s equation when g = c π(c)

43 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X Adjoint R 0 g (x) =E [ g(x(t)) ] X(0) = x, g L 2 (π) σ [0] α t 0 Backward hitting time α =max{ t k : X(t) =α } σ [k] CTCN 07

44 Value function approximation: Other metrics Bellman error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost)

45 Value function approximation: Other metrics Bellman error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost) CLT error: Tadic & M. 03, Kim & Henderson 03, CTCN 07 σ 2 (θ) =E π [(h h θ ) 2 (Ph Ph θ ) 2 ] (h solves Poisson s eqn)

46 Value function approximation: Other metrics Bellman error: CLT error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost) σ 2 (θ) =E π [(h h θ ) 2 (Ph Ph θ ) 2 ] (h solves Poisson s eqn) In each case we can construct an adjoint equation to solve for optimal parameter

47 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning Numerics κ =1 κ =4 {h (x) =h 0 } w 1 Conclusions n

48 Once optimal CV is identified: µ 1 Station 1 µ 4 µ 2 Station 2 µ 3 α 3 Smoothed estimator using fluid value function: g = h P h = Dh, h = J x 10 Standard estimator Smoothed estimator

49 Policy Selection µ 1 Station 1 µ 4 µ 2 Station 2 µ 3 α 3 Serve buffer 1 if buffer 4 is empty, or W 2 s i W 1 β 1 and m 2 Q 2 + m 3 Q 3 w 2 Performance as a function of safety stocks w w2 (i) Simulation using smoothed estimator w w2 (i) Simulation using standard estimator 10

50 Optimal speed scaling Joint work with Adam Wierman and the students of ECE 555 Intel s huge bet turns iffy. John Markoff and Steve Lohr. New York Times, September What matters most to the computer designers at Google is not speed, but power, low power, because data centers can consume as much electricity as a city. Dr. Eric Schmidt, CEO of Google [1] O. S. Unsal andi. Koren, System-level power-aware deisgn techniques in real-time systems, Proc. IEEE, vol. 91, no. 7, pp , [2] S. Irani andk. R. Pruhs, Algorithmic problems in power management, SIGACT News, vol. 36, no. 2, pp , [3] S. Kaxirasand M. Martonosi, Computer Architecture Techniques for Power-Efficiency. MorganandClaypool, 2008.

51 Optimal speed scaling Joint work with Adam Wierman and the students of ECE 555 Can we adjust power to computer processor to optimize delay/power utilization tradeoff? 0 Power log(dyn power) Intel PXA 2 TCP proc Pentium M log(freq) Speed [1] O. S. Unsal andi. Koren, System-level power-aware deisgn techniques in real-time systems, Proc. IEEE, vol. 91, no. 7, pp , [2] S. Irani andk. R. Pruhs, Algorithmic problems in power management, SIGACT News, vol. 36, no. 2, pp , [3] S. Kaxirasand M. Martonosi, Computer Architecture Techniques for Power-Efficiency. MorganandClaypool, 2008.

52 Optimal speed scaling - queueing model Abstraction: A controlled queue Q(k + 1) Q(k) = U(k) + A(k + 1) Processing rate Job arrivals Minimize the average cost: Quadratic processing cost for sake of example

53 Optimal speed scaling - fluid model Abstraction: A controlled fluid model Processing rate Job arrivals Minimize the total cost (up to draining time): J (x) = min 0 T 0 c(q(t), ζ(t)) dt, q(0) = x = αx (2x + α2 ) 3/2 α 3

54 Optimal speed scaling - value functions Relative value function h Fluid value function J x J (x) = min 0 T 0 c(q(t), ζ(t)) dt, q(0) = x

55 Optimal speed scaling - policies φ J (x) = arg min 0 u x [ c(x, u) + P u J (x)] input u Stochastic optimal policy queue length x J (x) = min 0 T 0 c(q(t), ζ(t)) dt, q(0) = x

56 Optimal speed scaling - TD Learning 1 θ 1 (t) Basis θ 2 (t) x 10 5 t

57 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics κ =1 κ =4 Conclusions n

58 Conclusions Possible to apply adjoint for algorithm synthesis in many settings Many questions remaining! Variance of Newton and SA recursions Parametizations in networks Complexity: workload relaxations (see CTCN and Henderson and M. 2003) Policy improvement, approximate DP Veatch 2004 Mannor, Menache, and Shimkin, 2005 Moallemi, Kumar, and Van Roy 2007

59 Conclusions Possible to apply adjoint for algorithm synthesis in many settings Many questions remaining! Variance of Newton and SA recursions Parametizations in networks Complexity: workload relaxations Policy improvement, approximate DP Chapter 11 Control Techniques for Complex Networks Cambridge 2007

Average-cost temporal difference learning and adaptive control variates

Average-cost temporal difference learning and adaptive control variates Sean Meyn Department of ECE and the Coordinated Science Laboratory Joint work with S. Mannor, McGill V. Tadic, Sheffield S. Henderson,