Q-Learning and Stochastic Approximation

MS&E338 Reinforcement Learning Lecture 4-04.11.018 Q-Learning and Stochastic Approximation Lecturer: Ben Van Roy Scribe: Christopher Lazarus Javier Sagastuy In this lecture we study the convergence of Q-Learning updates via stochastic approximation results. 1 Q-learning update Assume we are given data samples of the following form: s, a, r +1, s +1 = 0, 1,,... and we iteratively apply the following update to the state-action value function: { 1 γ Q s, a + γ r +1 + max a Q +1 s, a = A Q s +1, a if s = s, a = a 1 Q s, a otherwise Proposition 1. If each s, a S A is sampled infinitely often and we have {γ } deterministic satisfying s, a γ =, γ <, then Q Q :s,a =s,a :s,a =s,a Originally, in class, there were a couple of issues with the formulation of the above proposition: we want each state-action pair to be sampled infinitely often. If an adversary is deciding that, it can try to game the system to just choose an update at particular times, for example: Adversary pics particular state action pairs when γ is zero. Adversary spaces updates, so that spaces get bigger and bigger and it can loo as though the values for gamma are plummeting quicly. We can rewrite the updating equation as temporal difference {}} { Q +1 s, a = Q s, a + γ r +1 + max Q s +1, a Q s, a a A A temporal difference is a difference between predictions that you can mae in consecutive time periods. It is composed of two parts: the current value of the state-action function and the proposed value by the greedy update to the Q function. The difference helps us realize if we over or underestimated the previous value of the Q function. ote that the Q-learning update can be understood as an asynchronous version of stochastic approximation to value iteration. Thus, in the following section we study Stochastic Approximation as a way to understand the convergence properties of Q-Learning. Stochastic Approximation We are not going to learn a comprehensive foundation to prove stochastic approximation results. We will go over intuition for how formal analysis of these things goes. The idea to stochastic approximation is: 1

x +1 = x + γ sx, w where w is a random disturbance. Let s assume that w is ergodic and in particular there is some stationary distribution for w so we can tal about expectation and distribution. In fact, if w is drawn form the steady state distribution, let sx = E [sx, w ] Under various technical conditions when γ is small the sequence approximates an ODE of the form x t = sx t The idea is that we want a relationship from a continuous to a discrete sequence. continuous {}}{ x t t discrete {}}{ x 1 γ i The γ is can be thought of lie the dt in an ODE. x t+dt x t = sx t dt x t+dt = x t + dt sx t If γ =, γ < then we would converge with this update rule. As increases, this follows more and more closely the tracs of the ODE. Let s loo at a simple case: assume the sample mean of w i.i.d. with E [ w] < x = 1 1 w i x +1 = 1 x +1 = w i = 1 w + 1 1 x approximately, using the LL x w = 1 γ x + γ w if we let γ = 1 By the law of large numbers, x E[w 0 ]. In fact, as long as γ = and γ <, we get x E[w 0 ]..1 Understanding the intuition behind the ODE approximation We now reason about the same example in a continuous context. Let h be a small increment in the continuous time domain. x t+h = x t + hsx t, w t

Suppose large but << 1 so that h << 1. Then, h 1 x t+h = x t + h sx t+nh, w t+nh 1 x t + h sx t, w t+nh = x t + h = x t + dt dt 1 1 sx t + O sx t, w t+nh 1 from equation by induction we assume that x t doesn t change much since nh is small multiplying and dividing by 1 where in the last step, O comes from the fact that the variance of the mean of iid samples is on the order of 1. We will leverage basic martingale convergence theorems and use that to prove how basic stochastic approximation theorems wor.. A prototypical example of a stochastic approximation result Proposition. If w i.i.d, and there exist x, c 1, c such that for all x 1. x x sx c 1 x x. E [ sx, w ] c 1 + x x and if γ is deterministic with γ =, γ <, then we have x x with probability 1. Intuitively, the two necessary conditions in proposition can be thought as follows: 1. States that you are going in the right direction fast enough.. Gives a bound on the noise provided by w. It may be very noisy, but not too much. If you are far away you are allowed large noise cause that is compensated by the fact that you can tae big steps. Theorem 1. Supermartingale Convergence Theorem Let X, Y, Z = 0, 1,,... be nonnegative scalar random variables, with Y <. If E [X +1 ] X + Y Z then with probability 1, lim X exists and is finite, and Z <. In the above statement, the sub on the expectation means conditioning on X 0,..., X. ote that the result presented in Theorem 1 is a stochastic generalization of convergence in real analysis. We are not going to prove Theorem 1, it is pretty hard, but we will use it to prove proposition : 3

Proof. Let = x x then +1 = x + γ sx, w x = + γ sx, w γ x x sx, w E [ +1 ] + γc 1 + x x γ c 1 x x from the necessary conditions on prop. = + γ c 1 + γ c 1 = + c γ c 1 γ X E [X +1 ] X + Y Z Z + c γ Y We now want to see if the variables we just defined satisfy the conditions on Theorem 1. First, note that Y = c γ < since γ <. Also, X and Y as defined are non-negative. However, note that Z = C 1 γ C γ could tae on negative values. But since γ converges to 0, there will be some K such that K, Z 0. Thus, we can loo at Z starting at index = K. ow, from Theorem 1 we now that X = = x x converges to a finite random variable with probability 1. We also now that Z <. Since Z = c γ c 1 γ < and we now that c γ <, then c 1 γ <. However, since c 1 is a constant and γ =, the only way the previous sum could tae on a finite value is if is converging to zero. If converged to any value other than zero, the sum would not converge. ow we finally now that = x x 0 with probability 1, which implies that x x with probability 1, as desired. Example 1. Consider F : R R s.t F x F y α x y, α 0, 1 with w i.i.d. E [w ] = 0, E [ w ] <. Then: 1. x +1 = x + γ F x x + w sx, w = F x x + w sx = F x x x = F x x x sx = x x F x x + x x x x x x x x F x x + x x α x x + x x = 1 α x x 4

. [ ] E sx, w [ = E F x x + w E [ F x x + w ] E [1 + α x x + w ] [ ] E 1 + α x x + w [ ] 1 + α 1 + E w C ] 1 + x x Thus, by Proposition, x x. One last problem to thin about: in the last homewor assignment we showed that asynchronous Value Iteration converges to the optimal value function. ow, what if instead of having a maximum norm contraction mapping, we had a contraction with respect to the euclidean norm? We can show that this does not guarantee convergence with an asynchronous update although you do get it in the synchronous case. Come up with an example that there is an asynchronous process that does not get you to the fixed point with the euclidean contraction. Homewor # Let T : R R be such that α 0, 1 s.t. V, V, update V +1 = { T V s if s = s V s otherwise T V T V α V V. Consider the Show that V may not converge to V even if each s S is selected infinitely often. References [1] John Tsitsilis. Asynchronous stochastic approximation and q-learning. Machine learning, 163:185 0, 1994. 5