Average-cost temporal difference learning and adaptive control variates

Size: px

Start display at page:

Download "Average-cost temporal difference learning and adaptive control variates"

Candice Chandler
5 years ago
Views:

1 Average-cost temporal difference learning and adaptive control variates Sean Meyn Department of ECE and the Coordinated Science Laboratory Joint work with S. Mannor, McGill V. Tadic, Sheffield S. Henderson, Cornell Work supported in part by NSF Grants ECS and

2 Outline Value function approximation w 2 Infeasible R Performance evaluation Performance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics κ =1 κ =4 Conclusions n

3 References Chapter 11 Control Techniques for Complex Networks Cambridge University Press, 2007 Preprints available at Rate Control Scheduling Congestion Routing Information? Workload release Link Outage Breakdown Demand Re-work

4 Outline Value function approximation Performance evaluation Performance improvement Simulation TD learning w 2 Infeasible R {h (x) =h 0 } w 1 Numerics Conclusions κ =1 κ = n

5 Average cost { X(t) :t =0,1,2,... } P (x,y) πp = π c: X R + η = π(c) Ergodic Markov chain on X (countable) Transition matrix Invariant probability measure Cost function (near monotone) Finite average cost η = π(c) = lim n 1 n n c(x(t)) t=0

6 Value functions Value functions: Discounted [ ] h(x):=e (1 + γ) t 1 c(x(t)) t =0 Relative [ τ 0 ] h(x):=e (c(x(t)) η) t =0 X(0) = x X

7 Value functions Value functions: Discounted [ ] h(x):=e (1 + γ) t 1 c(x(t)) t =0 Relative [ τ 0 ] h(x):=e (c(x(t)) η) t =0 X(0) = x X Dynamic programming equations: Discounted [γi D]h = c D = P I Relative D h = c η

8 Value function approximation Discounted-cost value function: [ ] h(x) =E (1 + γ) t 1 c(x(t)) t =0 Network model with c(x) a norm on X Under general conditions, approximated by scaled cost: lim r 1 r h(rx) = 1 γ c(x)

9 Value function approximation Discounted-cost value function: [ ] h(x) =E (1 + γ) t 1 c(x(t)) t =0 Network model with c(x) a norm on X Under general conditions, approximated by scaled cost: lim r and the fluid value function, 1 r h(rx) = 1 γ c(x) h(x) J γ (x) = 0 e γt c(q(t)) dt, q(0) = x

10 Value function approximation Relative value function, [ τ 0 ] h(x) = E (c(x(t)) η) t =0 h solves Poisson s equation, D h = c + η

11 Value function approximation Value function approximated by fluid model: [ τ 0 ] h(x) = E (c(x(t)) η) J(x) = t =0 0 c(q(t)) dt Relative value function Fluid value function Under general conditions, lim r 1 h(rx) =J(x) r2

12 Value function approximation Value function approximated by fluid model: [ τ 0 ] h(x) =E (c(x(t)) η) J(x) = t =0 0 c(q(t)) dt Infeasible R Workload model: level sets of h ι 2 (k) = ι 2 (k) > 0 20 {h (x) =h 0 }

13 Value function approximation: Simulation Simulation: Standard Monte Carlo, η k = k 1 k 1 t=0 c(x(t)) CLT variance: σ 2 = π(h 2 (Ph) 2 ) where h solves Poisson s equation D h = c + η

14 Value function approximation: Simulation Simulation: Standard Monte Carlo, η k = k 1 k 1 t=0 c(x(t)) CLT variance: σ 2 = π(h 2 (Ph) 2 ) where h solves Poisson s equation D h = c + η Example: Simulating the single queue: CLT variance = O 1 (1 ρ) 4

15 Value function approximation: Simulation ρ =0.9 Gaussian density Histogram 100, ,000 t=1 Q(t) η Example: Simulating the single queue: CLT variance = O Ch 11 CTCN 1 (1 ρ) 4

16 Value function approximation: Simulation Null LDP rate function: η n n 1 = n 1 t=0 Q(t) lim log P{η n r} 0, r > η n n M n Example: Simulating the single queue: CLT variance = O 1 (1 ρ) 4

17 Value function approximation: Simulation Control variates: g : X R has known mean π(g )=0, Smoothed estimator: η k (ϑ) =k 1 k 1 t=0 c(x(t)) ϑg (X(t)), k 1 Perfect estimator: g = h Ph = Dh = c η Zero CLT variance

18 Steady state customer population Smoothed estimator using fluid value function: g = h Ph = Dh, h = J ˆ x 10 Standard estimator Smoothed estimator

19 Steady state customer population Smoothed estimator using fluid value function: Performance as a function of safety stocks g = h Ph = Dh, h = Ĵ w w w w (i) Simulation using smoothed estimator (ii) Simulation using standard estimator

20 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics Conclusions κ =1 κ = n

21 Value function approximation Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 How to find the best parameter θ? How to define best?

22 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π

23 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π = 2 E π [(h θ (X(k)) h(x(t))) ] θ h θ h 2 π =2E π [(h θ (X(k)) h(x(t))) ψ (X(t))] Setting the gradient to zero gives optimal parameter

24 Value function approximation: Hilbert space setting Linear parametrization: h θ (x):=θ T ψ(x) = l h θ i ψ i (x), x X i=1 Minimize the mean square error h θ h 2 π = 2 E π [(h θ (X(k)) h(x(t))) ] θ h θ h 2 π =2E π [(h θ (X(k)) h(x(t))) ψ (X(t))] Setting the gradient to zero gives optimal parameter θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

25 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg)

26 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ]

27 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0

28 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0 b = E π [h(x)ψ(x)] = R γ c,ψ π

29 Value function approximation: Hilbert space setting Define for two functions f, g on X, f,g π = π(fg) L norm, 2 f g 2 π := f g,f g π = E π [(f(x(0)) g(x(0))) 2 ] Resolvent, R γ = (1 + γ) (t+1) P t t=0 Adjoint, R γ f,g π = f,r γ g π, f, g L 2 (π)

30 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) (discounted cost)

31 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))]

32 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))] R γ f, g π = t=0 (1 + γ) t 1 E[f(X(0))g(X( t))]

33 Representing the adjoint For the stationary process, R γ f, g π = E t=0 (1 + γ) t 1 P t f (X(0)) g(x(0)) Smoothing, E[P t f (X(0))g(X(0))] = E[f(X(t))g(X(0))] = E[f(X(0))g(X( t))] R γ f, g π = t=0 (1 + γ) t 1 E[f(X(0))g(X( t))] = f, R γ g π

34 Representing the adjoint Adjoint is defined in terms of the time-reversed process R γ g (x) = (1 + γ) t 1 E[ g(x( t)) X(0) = x ] t=0 Causal computable!

35 TD Learning for discounted cost Causal computable! 1 θ 2 h θ h 2 π = E π [(h θ (X(k)) h(x(t))) ψ (X(t))] = (h θ R γ c), ψ π = R γ (Rγ 1 h θ c), ψ π = Rγ 1 h θ c, R γ ψ π = d, ϕ π

36 TD Learning for discounted cost Causal computable! E[d(k)ϕ(k + 1)] 1 2 θ h θ h γ 2 π, θ(i) θ 1 θ 2 h θ h 2 π = E π [(h θ (X(k)) h(x(t))) ψ (X(t))] = (h θ R γ c), ψ π = R γ (Rγ 1 h θ c), ψ π = Rγ 1 h θ c, R γ ψ π = d, ϕ π

37 TD Learning for discounted cost Causal computable! E[d(k)ϕ(k + 1)] 1 2 θ h θ h γ 2 π, θ(i) θ (i) Temporal differences defined by, d(k) = [(1 + γ)h θ(k) (X(k)) h θ(k) (X(k + 1)) c(x(k))] (ii) Eligibility vectors: the sequence of d-dimensional vectors, ϕ(k) = k (1 + γ) t 1 ψ(x(k t)), k 1 t=0

38 Algorithms TD(1) algorithm θ(k +1) θ(k) =a k d(k)ϕ(x(k +1)), k 0 a k stochastic approx. gain

39 Algorithms TD(1) algorithm θ(k +1) θ(k) =a k d(k)ϕ(x(k +1)), k 0 a k stochastic approx. gain Consistent under mild conditions, θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

40 Algorithms Newton algorithm θ(k +1)= 1 k k i=1 ψ(x(i +1))ψ T (X(i +1)) 1 1 k k i=1 c(i)ϕ(x(i +1)) Consistent under mild conditions, θ =Σ 1 b b = E π [h(x)ψ(x)], Σ=E π [ψ(x)ψ(x) T ]

41 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X

42 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X Solves Poisson s equation when g = c π(c)

43 TD Learning for average cost Potential matrix R 0 g (x) =E x [ σ α t=0 ] g(x(t)), x X σ α hitting time to α X Adjoint R 0 g (x) =E [ g(x(t)) ] X(0) = x, g L 2 (π) σ [0] α t 0 Backward hitting time α =max{ t k : X(t) =α } σ [k]

44 Value function approximation: Other metrics Bellman error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost)

45 Value function approximation: Other metrics Bellman error: CLT error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost) σ 2 (θ) =E π [(h h θ ) 2 (Ph Ph θ ) 2 ] (h solves Poisson s eqn)

46 Value function approximation: Other metrics Bellman error: CLT error: E π [(Ph θ (X) (1 + γ)h θ (X)+c(X)) 2 ] (discounted cost) σ 2 (θ) =E π [(h h θ ) 2 (Ph Ph θ ) 2 ] (h solves Poisson s eqn) In each case we can construct an adjoint equation to solve for optimal parameter

47 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning Numerics κ =1 κ =4 {h (x) =h 0 } w 1 Conclusions n

48 Simulating the M/M/1queue X(t +1)=[X(t)+Δ(t + 1)] +, X(0) = x Z + Increment process Δ is i.i.d., Δ(t) = { 1, with prob. p =(δ + n)/(n +1); n, with prob. 1 p Mean = δ, Variance = O(n) Two special cases: δ = 1 10 and, 1. TheM/M/1 queue, n =1 2. Bursty queue, n =4

49 Simulating the M/M/1queue X(t +1)=[X(t)+Δ(t +1)] + Smoothed estimator: 1 N N 1 t=0 X(t) g θ t (X(t)) g θ = h θ Ph θ h θ (x) =θ 1 x + θ 2 x 2 θ 1 = θ 2 θ 1 θ 2 15 ρ =0.8 ρ = x x 10

50 Simulating the M/M/1queue X(t +1)=[X(t)+Δ(t +1)] + State weighting for variance reduction: f g 2 π,ω := E[(f(X) g(x)) 2 /Ω(X)] h θ (x) =θ 1 x + θ 2 x 2 θ 1 = θ 2

51 Simulating the M/M/1queue X(t +1)=[X(t)+Δ(t + 1)] + State weighting for variance reduction: f g 2 π,ω := E[(f(X) g(x)) 2 /Ω(X)] h θ (x) =θ 1 x + θ 2 x 2 θ 1 = θ 2 Quartic weighting results in much lower variance: θ 1 θ 2 Ω(x) =[1+(μ α)x] x x 10

52 Outline Value function approximation w 2 Infeasible R Performance evaluation Perfromance improvement Simulation TD learning {h (x) =h 0 } w 1 Numerics κ =1 κ =4 Conclusions n

Variance of Newton and SA recursions Parametizations in networks

53 Conclusions Value function approximation: as fundamental as the Kalman filter Many questions remaining! Variance of Newton and SA recursions Parametizations in networks Policy improvement, approximate DP Other applications, such as MCMC (very recent work by Van Roy et. al.)

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory TD Learning Sean Meyn Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory Joint work with I.-K. Cho, Illinois S. Mannor, McGill V. Tadic, Sheffield