Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, USA Thanks to NSF & AFOSR
Outline 1 Background: Stochastic Approximation 2 Background: Q-learning 3 Quasi-Stochastic Approximation 4 Conclusions 2 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Robbins and Monro Stochastic Approximation Setting: Solve the equation h(ϑ) = 0, with h(ϑ) = E[h(ϑ, ζ)], where ζ is random. Robbins and Monro [5] : Fixed point iteration with noisy measurements, ϑ n+1 = ϑ n + a n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. 3 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Robbins and Monro Stochastic Approximation Setting: Solve the equation h(ϑ) = 0, with h(ϑ) = E[h(ϑ, ζ)], where ζ is random. Robbins and Monro [5] : Fixed point iteration with noisy measurements, ϑ n+1 = ϑ n + a n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Typical assumptions for convergence: Step size a n = (1 + n) 1 Random sequnce ζ n is identical to ζ in distribution and i.i.d.. Global stability of ODE θ = h(θ). Excellent recent reference: Borkar 2008 [2]. 3 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). 4 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). Central Limit Theorem: n ϑn N(0, Σ ϑ ) 4 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). Central Limit Theorem: n ϑn N(0, Σ ϑ ) Holds under mild conditions. Lyapunov equation for variance, requires eig (A) < 1 2 : (A + 1 2 I)Σ ϑ + Σ ϑ (A + 1 2 I)T + Σ Z = 0 4 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. 5 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Answer: Stochastic Newton-Raphson, so that Γ n A 1. 5 / 15
Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Answer: Stochastic Newton-Raphson, so that Γ n A 1. Proof: Linearization reduces to standard Monte-Carlo 5 / 15
Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. 6 / 15
Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. Estimates of ϑ using i.i.d. and deterministic sequences (each ζ): 1 0.9 ζ i.i.d. Gaussian ζ = φ 1 ( ) deterministic 0.8 0.7 0.6 0.5 0 5000 0 5000 6 / 15
Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. Estimates of ϑ using i.i.d. and deterministic sequences (each ζ): 1 0.9 ζ i.i.d. Gaussian ζ = φ 1 ( ) deterministic 0.8 0.7 0.6 0.5 0 5000 0 5000 Analysis requires theory of quasi-stochastic approximation... 6 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) u }{{} } = γh(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Q(x, u) = c(x, u) + f(x, u) h(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) = c(x, u) + γ 1 f(x, u) Q (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) = c(x, u) + γ 1 f(x, u) Q (x) Parameterization set of approximations, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Model-free form: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) d dt Qθ (x(t)) x=x(t), u=u(t) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15
Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Model-free form: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) d dt Qθ (x(t)) x=x(t), u=u(t) Q-learning of M&M Find zeros of h(ϑ) = E[E ϑ (x, u) 2 ] ζ = (x, u ) ergodic steady-state. Choose input: stable feedback + mixture of sinusoids, u(t) = k(x(t)) + ω(t), Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15
Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) HJB Equation: [ min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u 9 / 15
Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 9 / 15
Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 9 / 15
Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 1 Optimal policy 0.08 0.07 1 Optimal policy 0.06 0.06 0.05 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 1 1 0 1 Low amplitude input 1 1 0 1 High amplitude input 9 / 15
Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 1 Optimal policy 0.08 0.07 1 Optimal policy 0.06 0.06 0.05 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 1 1 0 1 Low amplitude input 1 1 0 1 High amplitude input Analysis requires theory of quasi-stochastic approximation... 9 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 10 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 Decreasing gain: a(t) 0, with 0 a(t) dt =, 0 a(t) 2 dt < usually, take a(t) = (1 + t) 1 10 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 Decreasing gain: a(t) 0, with 0 a(t) dt =, 0 a(t) 2 dt < usually, take a(t) = (1 + t) 1 Stable ODE: h is globally Lipschitz, and ϑ = h(ϑ) is globally asymptotically stable, with globally Lipschitz Lyapunov function 10 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Stability & Convergence Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt Decreasing gain: a(t) 0, with 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 a(t) dt =, 0 a(t) 2 dt < 0 Stable ODE: h is globally Lipschitz, and ϑ = h(ϑ) is globally asymptotically stable, with globally Lipschitz Lyapunov function Theorem: ϑ(t) ϑ for any initial condition. 11 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 12 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 Rate of convergence is t 1, and not t 1 2, under these assumptions: 12 / 15
Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 Rate of convergence is t 1, and not t 1 2, under these assumptions: Theorem: For some constant σ <, lim sup t t ϑ(t) ϑ σ 12 / 15
Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. 13 / 15
Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ 13 / 15
Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t 13 / 15
Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t Linear convergence under stability alone 1 There is a finite constant σ satisfying, lim sup t t ϑ(t) ϑ σ 13 / 15
Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t Linear convergence under stability alone 1 There is a finite constant σ satisfying, lim sup t 2 Question: Is σ minimal? t ϑ(t) ϑ σ 13 / 15
Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. 14 / 15
Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 14 / 15
Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 1 First we must answer, what is the optimal value of σ? 14 / 15
Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 1 First we must answer, what is the optimal value of σ? 2 Concentration on applications: Further development of Q-learning Applications to nonlinear filtering. 14 / 15
Conclusions References D.P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Atena Scientific, Cambridge, Mass, 1996. V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin s minimum principle. In Proc. of the 48th IEEE Conf. on Dec. and Control; held jointly with the 2009 28th Chinese Control Conference, pages 3598 3605, Dec. 2009. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951. C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279 292, 1992. 15 / 15