Stochastic Gradient Descent in Continuous Time

Size: px

Start display at page:

Download "Stochastic Gradient Descent in Continuous Time"

Hubert Hood
5 years ago
Views:

1 Stochastic Gradient Descent in Continuous Time Justin Sirignano University of Illinois at Urbana Champaign with Konstantinos Spiliopoulos (Boston University) 1 / 27

2 We consider a diffusion X t X = R m : dx t = f (X t )dt + σdw t. The goal is to statistically estimate a model f (x, θ) for f (x) where θ R n. f (x, θ) and f (x) may be non-convex W t is a standard Brownian motion. The diffusion term W t represents any random behavior of the system or environment. 2 / 27

3 The parameter update satisfies the SDE: dθ t = α t [ θ f (X t ; θ t )(σσ T ) 1 dx t θ f (X t, θ t )(σσ T ) 1 f (X t, θ t )dt ] α t is the learning rate Can be used for both: Statistical estimation given previously observed data Online learning (i.e., statistical estimation in real-time as data becomes available) If m = 1 and σ = 1: dθ t = α t [ θ f (X t ; θ t )dx t θ f (X t, θ t )f (X t, θ t )dt ] 3 / 27

4 Why is Stochastic Gradient Descent in Continuous Time useful? Physics and engineering models are typically in continuous time. It therefore makes sense to also develop the statistical learning updates in continuous time. Continuous-time dynamics are oftentimes much simpler than discrete dynamics at longer time intervals. 4 / 27

5 Although stochastic gradient descent in continuous time must ultimately be discretized for numerical implementation, the continuous-time framework still has significant numerical advantages. Continuous-time stochastic gradient descent allows for the control and reduction of numerical error due to discretization Example 1: Higher-order numerical schemes for numerical solution of SDE Example 2: Non-uniform time step sizes. If convergence is slow, the time step size may be adaptively decreased. In contrast, discrete-time stochastic gradient descent uses fixed discrete steps and cannot do this. 5 / 27

6 Overview of Result Assume X t is ergodic and has a unique invariant measure π(dx). Define: h(θ) = X h(x, θ)π(dx) Define the natural objective function: g(x, θ) = 1 2 f (x, θ) f (x) 2 σσ T We show that lim ḡ(θ t) = 0, almost surely. t 6 / 27

7 Assumptions Assume that 0 α t dt =, 0 αt 2 dt < and that 0 α s ds <. The condition 0 α s ds < follows immediately from the other two restrictions for the learning rate if it is chosen to be a monotonic function of t. A standard choice is α t = 1 C+t for some constant 0 < C <. Polynomial bounds on g and f is Lipschitz (see our paper for details) 7 / 27

8 Related Literature Extensive research on stochastic gradient descent in discrete time. Relatively little research for continuous time Bertsekas and Tsitsiklis (2000) prove convergence of stochastic gradient descent in discrete time in the absence of the X process. The X term introduces correlation across times, and this correlation does not disappear as t Unlike in Bertsekas and Tsitsiklis (2000) where parameter updates are unbiased and noise is i.i.d., the X process causes parameter updates to be biased and correlated across times. This complicates the analysis. 8 / 27

9 ODE method : proves discrete-time stochastic gradient descent converges to the solution of an ODE which itself converges to a limiting point, Kushner and Yin (2003), Benveniste, Metivier and Priouret (2012) Requires the strong assumption that the iterates (i.e., the model parameters which are being learned) remain in a bounded set with probability one. Proving that the iterates remain in a bounded set with probability one can be challenging to show and, moreover, may not necessarily be true for all models. 9 / 27

10 Proof Approach Consider the cycles of random times where for k = 1, 2, 0 = σ 0 τ 1 σ 1 τ 2 σ 2... τ k = inf{t > σ k 1 : ḡ(θ t ) κ} σ k = sup{t > τ k : ḡ(θ τ k ) ḡ(θ s ) 2 ḡ(θ τk ) 2 for all s [τ k, t] and t τ k α s ds λ} The purpose of these random times is to control the periods of time where ḡ(θ ) is close to zero and away from zero. Let us next define the random time intervals I k = [τ k, σ k ) and J k = [σ k 1, τ k ). Notice that for every t J k we have ḡ(θ t ) < κ. 10 / 27

11 Suppose that there are an infinite number of intervals I k = [τ k, σ k ). There is a fixed constant γ = γ(κ) > 0 such that for k large enough, one has Then, ḡ(θ t ). ḡ(θ σk ) ḡ(θ τk ) γ However, ḡ 0. Therefore (by contradiction) there are a finite number of intervals I k. 11 / 27

12 + + + σk ḡ(θ σk ) ḡ(θ τk ) = σk σk τ k σk τ k α s ḡ(θ s ) 2 ds α s ḡ(θs ), θ f (X s, θ s )σ 1 dw s τ k αs 2 [ ] 2 tr ( θ f (X s, θ s )σ 1 )( θ f (X s, θ s )σ 1 ) T θ θ ḡ(θ s ) τ k α s θ ḡ(θ s ), θ ḡ(θ s ) θ g(x s, θ s ) ds ds Recall that 0 α t dt = and 0 α 2 t dt < (Ex: α t = 1 1+t ). 12 / 27

13 Most Difficult Term σk τ k α s θ ḡ(θ s ), θ ḡ(θ s ) θ g(x s, θ s ) ds Rewrite this term using an associated Poisson equation. Assume: X G(x, θ)π(dx) = 0 Let L x be the generator for the X process. Then the Poisson equation L x u(x, θ) = G(x, θ) has a unique solution (with some nice properties). 13 / 27

14 Numerical Examples Ornstein-Uhlenbeck (OU) process Multi-dimensional OU process Burger s equation Reinforcement learning 14 / 27

15 The Ornstein-Uhlenbeck (OU) process X t R satisfies the stochastic differential equation: dx t = c(m X t )dt + dw t. We use continuous stochastic gradient descent to learn the parameters θ = (c, m) R 2. f (x, θ) = c(m x) and f (x) = f (x, θ ) 15 / 27

16 We study 10, 500 cases. For each case, a different θ is randomly generated in the range [1, 2] [1, 2]. For each case, we solve for the parameter θ t over the time period [0, T ] for T = To summarize: For cases n = 1 to 10,500 Generate a random θ in [1, 2] [1, 2] Simulate a single path of X t given θ and simultaneously solve for the path of θ t on [0, T ] 16 / 27

17 Figure: Mean error in percent plotted against time. Time is in log scale. 17 / 27

18 Figure: Mean squared error plotted against time. Time is in log scale. 18 / 27

19 The multidimensional Ornstein-Uhlenbeck process X t R d satisfies the stochastic differential equation: dx t = (M AX t )dt + dw t. We use continuous stochastic gradient descent to learn the parameters θ = (M, A) R d R d d. 19 / 27

20 Figure: Mean error in percent plotted against time. Time is in log scale. 20 / 27

21 Figure: Mean squared error plotted against time. Time is in log scale. 21 / 27

22 The stochastic Burger s equation is: u t (t, x) = θ 2 u u(t, x) u x 2 x (t, x) + σ 2 W (t, x), t x where x [0, 1] and W (t, x) is a Brownian sheet. 22 / 27

23 Error/Time Maximum Error % quantile of error Mean squared error Mean Error in percent Maximum error in percent % quantile of error in percent Table: Error at different times for the estimate θ t of θ across 525 cases. The error is θ n t θ,n where n represents the n-th case. The error in percent is 100 θ n t θ,n / θ,n. 23 / 27

24 Figure: Mean error in percent plotted against time. Time is in log scale. 24 / 27

25 Figure: Mean squared error plotted against time. Time is in log scale. 25 / 27

26 We consider the classic reinforcement learning problem of balancing a pole on a moving cart. The goal is to balance a pole on a cart and to keep the cart from moving outside the boundaries via applying a force of ±10 Newtons. The position x of the cart, the velocity ẋ of the cart, angle of the pole β, and angular velocity β of the pole are observed. The dynamics of (x, ẋ, β, β) satisfy a set of ODEs 26 / 27

27 Reward/Episode Maximum Reward % quantile of reward Mean reward % quantile of reward Minimum reward Table: Reward at the k-th episode across the 525 cases using continuous stochastic gradient descent to learn the model dynamics. 27 / 27

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations