Quasi Stochastic Approximation American Control Conference San Francisco, June 2011

Similar documents
Average-cost temporal difference learning and adaptive control variates

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory

Stochastic Gradient Descent in Continuous Time

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Inverse Optimality Design for Biological Movement Systems

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

Procedia Computer Science 00 (2011) 000 6

Hamilton-Jacobi-Bellman Equation Feb 25, 2008

1 / 21 Perturbation of system dynamics and the covariance completion problem Armin Zare Joint work with: Mihailo R. Jovanović Tryphon T. Georgiou 55th

Randomized Algorithms for Semi-Infinite Programming Problems

Stochastic Composition Optimization

Perturbation of system dynamics and the covariance completion problem

Gradient-based Adaptive Stochastic Search

Problem 1 Cost of an Infinite Horizon LQR

Linear-Quadratic-Gaussian (LQG) Controllers and Kalman Filters

Central-limit approach to risk-aware Markov decision processes

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

arxiv: v1 [cs.lg] 23 Oct 2017

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Approximate Dynamic Programming

A Least Squares Q-Learning Algorithm for Optimal Stopping Problems

Temporal difference learning

arxiv: v2 [cs.sy] 27 Sep 2016

Monte Carlo methods for sampling-based Stochastic Optimization

Temporal-Difference Q-learning in Active Fault Diagnosis

Nonlinear Systems and Control Lecture # 19 Perturbed Systems & Input-to-State Stability

Learning parameters in ODEs

Open Theoretical Questions in Reinforcement Learning

Sufficient Conditions for the Existence of Resolution Complete Planning Algorithms

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

On Stochastic Adaptive Control & its Applications. Bozenna Pasik-Duncan University of Kansas, USA

On the Convergence of Optimistic Policy Iteration

Model-Based Reinforcement Learning with Continuous States and Actions

A framework for adaptive Monte-Carlo procedures

Risk-Sensitive and Robust Mean Field Games

1 Problem Formulation

Stability of optimization problems with stochastic dominance constraints

Dynamic programming using radial basis functions and Shepard approximations

Introduction to Nonlinear Control Lecture # 3 Time-Varying and Perturbed Systems

Online Control Basis Selection by a Regularized Actor Critic Algorithm

A Systematic Approach to Extremum Seeking Based on Parameter Estimation

c 2014 Krishna Kalyan Medarametla

HJB equations. Seminar in Stochastic Modelling in Economics and Finance January 10, 2011

Markov Decision Processes and Dynamic Programming

Stochastic Process II Dr.-Ing. Sudchai Boonto

Quadratic Stability of Dynamical Systems. Raktim Bhattacharya Aerospace Engineering, Texas A&M University

C~I- c-5~- n.=. r (s) = + 'I. ~t(y(-.*) - y^cbnld)

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Errors-in-variables identification through covariance matching: Analysis of a colored measurement noise case

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Value Function Based Reinforcement Learning in Changing Markovian Environments

6 OUTPUT FEEDBACK DESIGN

Auxiliary Particle Methods

Optimal Control. Lecture 18. Hamilton-Jacobi-Bellman Equation, Cont. John T. Wen. March 29, Ref: Bryson & Ho Chapter 4.

ON STEP SIZES, STOCHASTIC SHORTEST PATHS, AND SURVIVAL PROBABILITIES IN REINFORCEMENT LEARNING. Abhijit Gosavi

Lecture 5 Linear Quadratic Stochastic Control

SUPPLEMENT TO CONTROLLED EQUILIBRIUM SELECTION IN STOCHASTICALLY PERTURBED DYNAMICS

Controlled Diffusions and Hamilton-Jacobi Bellman Equations

ON THE POLICY IMPROVEMENT ALGORITHM IN CONTINUOUS TIME

ON STEP SIZES, STOCHASTIC SHORTEST PATHS, AND SURVIVAL PROBABILITIES IN REINFORCEMENT LEARNING. Abhijit Gosavi

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

Algorithms for Reinforcement Learning

An asymptotic ratio characterization of input-to-state stability

Reinforcement learning

Beyond stochastic gradient descent for large-scale machine learning

ON-LINE OPTIMIZATION OF SWITCHED-MODE DYNAMICAL SYSTEMS 1. X.C. Ding, Y. Wardi 2, and M. Egerstedt

Linearly-Convergent Stochastic-Gradient Methods

OPTIMAL CONTROL. Sadegh Bolouki. Lecture slides for ECE 515. University of Illinois, Urbana-Champaign. Fall S. Bolouki (UIUC) 1 / 28

Stochastic Hybrid Systems: Modeling, analysis, and applications to networks and biology

Chapter 2 Event-Triggered Sampling

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Approximate Dynamic Programming

HYBRID DETERMINISTIC-STOCHASTIC GRADIENT LANGEVIN DYNAMICS FOR BAYESIAN LEARNING

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

On Two Different Signal Processing Models

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Dynamic Matching Models

Finding the best mismatched detector for channel coding and hypothesis testing

A System Theoretic Perspective of Learning and Optimization

The Art of Sequential Optimization via Simulations

Linearly-Solvable Stochastic Optimal Control Problems

Deterministic Dynamic Programming

Controlled sequential Monte Carlo

Chapter 13 Wow! Least Squares Methods in Batch RL

Gradient estimation for attractor networks. Thomas Flynn

Stability and Asymptotic Optimality of h-maxweight Policies

Resource Pooling for Optimal Evacuation of a Large Building

Stability of Nonlinear Systems An Introduction

Statistical Sparse Online Regression: A Diffusion Approximation Perspective

Introduction to multiscale modeling and simulation. Explicit methods for ODEs : forward Euler. y n+1 = y n + tf(y n ) dy dt = f(y), y(0) = y 0

Reinforcement Learning in Continuous Action Spaces

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Lyapunov Stability Theory

arxiv: v1 [math.na] 26 Feb 2013

Lecture 5: Control Over Lossy Networks

ilstd: Eligibility Traces and Convergence Analysis

Dirac Mixture Density Approximation Based on Minimization of the Weighted Cramér von Mises Distance

Transcription:

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign, USA Thanks to NSF & AFOSR

Outline 1 Background: Stochastic Approximation 2 Background: Q-learning 3 Quasi-Stochastic Approximation 4 Conclusions 2 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Robbins and Monro Stochastic Approximation Setting: Solve the equation h(ϑ) = 0, with h(ϑ) = E[h(ϑ, ζ)], where ζ is random. Robbins and Monro [5] : Fixed point iteration with noisy measurements, ϑ n+1 = ϑ n + a n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. 3 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Robbins and Monro Stochastic Approximation Setting: Solve the equation h(ϑ) = 0, with h(ϑ) = E[h(ϑ, ζ)], where ζ is random. Robbins and Monro [5] : Fixed point iteration with noisy measurements, ϑ n+1 = ϑ n + a n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Typical assumptions for convergence: Step size a n = (1 + n) 1 Random sequnce ζ n is identical to ζ in distribution and i.i.d.. Global stability of ODE θ = h(θ). Excellent recent reference: Borkar 2008 [2]. 3 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). 4 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). Central Limit Theorem: n ϑn N(0, Σ ϑ ) 4 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Variance Linearization of Stochastic Approximation Assuming convergence, write ϑ n+1 a n [ A ϑn + Z n ] with Z n = h(ϑ, ζ n ) (zero mean), and A = h (ϑ ). Central Limit Theorem: n ϑn N(0, Σ ϑ ) Holds under mild conditions. Lyapunov equation for variance, requires eig (A) < 1 2 : (A + 1 2 I)Σ ϑ + Σ ϑ (A + 1 2 I)T + Σ Z = 0 4 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. 5 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Answer: Stochastic Newton-Raphson, so that Γ n A 1. 5 / 15

Background: Stochastic Approximation Background: Stochastic Approximation Stochastic Newton Raphson Corollary to CLT Question: What is the optimal matrix gain Γ n? ϑ n+1 = ϑ n + a n Γ n h(ϑ n, ζ n ), n 0, ϑ 0 R d given. Answer: Stochastic Newton-Raphson, so that Γ n A 1. Proof: Linearization reduces to standard Monte-Carlo 5 / 15

Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. 6 / 15

Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. Estimates of ϑ using i.i.d. and deterministic sequences (each ζ): 1 0.9 ζ i.i.d. Gaussian ζ = φ 1 ( ) deterministic 0.8 0.7 0.6 0.5 0 5000 0 5000 6 / 15

Background: Stochastic Approximation Example: Root finding h(ϑ, ζ) = 1 tan(ϑ) + ζ. ζ, normal random variable N(0, 9) h(ϑ, ζ) = 1 tan(ϑ) + ζ h(ϑ) = 1 tan(ϑ) = ϑ = π/4. Estimates of ϑ using i.i.d. and deterministic sequences (each ζ): 1 0.9 ζ i.i.d. Gaussian ζ = φ 1 ( ) deterministic 0.8 0.7 0.6 0.5 0 5000 0 5000 Analysis requires theory of quasi-stochastic approximation... 6 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) u }{{} } = γh(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Q(x, u) = c(x, u) + f(x, u) h(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) = c(x, u) + γ 1 f(x, u) Q (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) HJB equation for discounted-cost optimal control problem, min{c(x, u) + f(x, u) h (x) } = γh(x) u }{{} Q function Fixed point equation for Q-function: Writing Q = min u Q(x, u) = γh(x), Q(x, u) = c(x, u) + f(x, u) h(x) = c(x, u) + γ 1 f(x, u) Q (x) Parameterization set of approximations, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 7 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Model-free form: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) d dt Qθ (x(t)) x=x(t), u=u(t) Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15

Background: Q-learning Background: Q-learning M&M 2009 system: ẋ = f(x, u) cost: c(x, u) Parameterization, {Q ϑ (x, u) : ϑ R d } Bellman error: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) f(x, u) Q θ (x) Model-free form: E ϑ (x, u) = γ(q ϑ (x, u) c(x, u)) d dt Qθ (x(t)) x=x(t), u=u(t) Q-learning of M&M Find zeros of h(ϑ) = E[E ϑ (x, u) 2 ] ζ = (x, u ) ergodic steady-state. Choose input: stable feedback + mixture of sinusoids, u(t) = k(x(t)) + ω(t), Watkins & Dayan, Q-learning, 1992; Bertsekas & Tsitsiklis, NDP, 1996; Mehta & Meyn, CDC 2009 8 / 15

Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) HJB Equation: [ min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u 9 / 15

Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 9 / 15

Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 9 / 15

Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 1 Optimal policy 0.08 0.07 1 Optimal policy 0.06 0.06 0.05 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 1 1 0 1 Low amplitude input 1 1 0 1 High amplitude input 9 / 15

Background: Q-learning Example: Q-Learning ẋ = x 3 + u, c(x, u) = 1 2 (x2 + u 2 ) [ HJB Equation: min c(x, u) + ( x 3 + u) h (x) ] = γh(x) u Basis: Q ϑ (x, u) = c(x, u) + ϑ 1 x 2 xu + ϑ 2 1 + 2x 2 Control: u(t) = A[sin(t) + sin(πt) + sin(et)] 1 Optimal policy 0.08 0.07 1 Optimal policy 0.06 0.06 0.05 0.05 0.04 0 0.04 0 0.03 0.03 0.02 0.02 0.01 0.01 1 1 0 1 Low amplitude input 1 1 0 1 High amplitude input Analysis requires theory of quasi-stochastic approximation... 9 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 10 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 Decreasing gain: a(t) 0, with 0 a(t) dt =, 0 a(t) 2 dt < usually, take a(t) = (1 + t) 1 10 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Continuous time, deterministic version of SA Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 Decreasing gain: a(t) 0, with 0 a(t) dt =, 0 a(t) 2 dt < usually, take a(t) = (1 + t) 1 Stable ODE: h is globally Lipschitz, and ϑ = h(ϑ) is globally asymptotically stable, with globally Lipschitz Lyapunov function 10 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Stability & Convergence Assumptions Ergodicity: ζ satisfies d ϑ(t) = a(t)h(ϑ(t), ζ(t)) dt Decreasing gain: a(t) 0, with 1 T h(θ) = lim h(θ, ζ(t)) dt, for all θ R d. T T 0 a(t) dt =, 0 a(t) 2 dt < 0 Stable ODE: h is globally Lipschitz, and ϑ = h(ϑ) is globally asymptotically stable, with globally Lipschitz Lyapunov function Theorem: ϑ(t) ϑ for any initial condition. 11 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 12 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 Rate of convergence is t 1, and not t 1 2, under these assumptions: 12 / 15

Quasi-Stochastic Approximation Quasi-Stochastic Approximation Variance a(t) = 1/(1 + t) The model is linear: h(θ, ζ) = Aθ + ζ, and each λ(a) satisfies Re(λ) < 1. 1 T ζ has zero mean: lim ζ(t) dt = 0, at rate 1/T. T T 0 Rate of convergence is t 1, and not t 1 2, under these assumptions: Theorem: For some constant σ <, lim sup t t ϑ(t) ϑ σ 12 / 15

Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. 13 / 15

Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ 13 / 15

Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t 13 / 15

Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t Linear convergence under stability alone 1 There is a finite constant σ satisfying, lim sup t t ϑ(t) ϑ σ 13 / 15

Quasi-Stochastic Approximation Polyak and Juditsky Polyak and Juditsky [4, 2] obtain optimal variance for SA using a very simple approach: High gain, and averaging. Deterministic P&J: Choose high-gain, a(t) = 1/(1 + t) δ, with δ (0, 1). d dt γ(t) = 1 h(γ(t), ζ(t)) (1 + t) δ The output of this algorithm is then averaged: d dt ϑ(t) = 1 ( ) ϑ(t) + γ(t) 1 + t Linear convergence under stability alone 1 There is a finite constant σ satisfying, lim sup t 2 Question: Is σ minimal? t ϑ(t) ϑ σ 13 / 15

Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. 14 / 15

Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 14 / 15

Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 1 First we must answer, what is the optimal value of σ? 14 / 15

Conclusions Conclusions Summary: QSA is the most natural approach to approximate dynamic programming for deterministic systems, such as Q-learning [3]. Stability is established using standard techniques Linear convergence is obtained under mild stability assumptions. Current research: 1 Open question: Can we extend the approach of Polyak and Juditsky to obtain optimal convergence? 1 First we must answer, what is the optimal value of σ? 2 Concentration on applications: Further development of Q-learning Applications to nonlinear filtering. 14 / 15

Conclusions References D.P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Atena Scientific, Cambridge, Mass, 1996. V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin s minimum principle. In Proc. of the 48th IEEE Conf. on Dec. and Control; held jointly with the 2009 28th Chinese Control Conference, pages 3598 3605, Dec. 2009. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951. C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279 292, 1992. 15 / 15