Stochastic Variance Reduction Methods for Policy Evaluation

Size: px

Start display at page:

Download "Stochastic Variance Reduction Methods for Policy Evaluation"

Marjory Stevenson
5 years ago
Views:

1 Simon S. Du Jianshu Chen 2 Lihong Li 2 Lin Xiao 2 Dengyong Zhou 2 Abstract Policy evaluation is concerned ith estimating the value function that predicts long-term values of states under a given policy. It is a crucial step in many reinforcement-learning algorithms. In this paper, e focus on policy evaluation ith linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle-point problem, and then present a primal-dual batch gradient method, as ell as to stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even hen the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods. Introduction Reinforcement learning (RL) is a poerful learning paradigm for sequential decision making (see, e.g., Bertsekas & Tsitsiklis, 995; Sutton & Barto, 998). An RL agent interacts ith the environment by repeatedly observing the current state, taking an action according to a certain policy, receiving a reard signal and transitioning to a next state. A policy specifies hich action to take given the current state. Policy evaluation estimates a value function that predicts expected cumulative reard the agent ould receive by folloing a fixed policy starting at a certain state. In addition to quantifying long-term values of states, hich can be of interest on its on, value functions also provide Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 523, USA. 2 Microsoft Research, Redmond, Washington 9852, USA.. Correspondence to: Simon S. Du <ssdu@cs.cmu.edu>, Jianshu Chen <jianshuc@microsoft.com>, Lihong Li <lihongli@microsoft.com>, Lin Xiao <lin.xiao@microsoft.com>, Dengyong Zhou <denzho@microsoft.com>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 7, 27. Copyright 27 by the author(s). important information for the agent to optimize its policy. For example, policy-iteration algorithms iterate beteen policy-evaluation steps and policy-improvement steps, until a (near-)optimal policy is found (Bertsekas & Tsitsiklis, 995; Lagoudakis & Parr, 23). Therefore, estimating the value function efficiently and accurately is essential in RL. There has been substantial ork on policy evaluation, ith temporal-difference (TD) methods being perhaps the most popular. These methods use the Bellman equation to bootstrap the estimation process. Different cost functions are formulated to exploit this idea, leading to different policy evaluation algorithms; see Dann et al. (24) for a comprehensive survey. In this paper, e study policy evaluation by minimizing the mean squared projected Bellman error (MSPBE) ith linear approximation of the value function. We focus on the batch setting here a fixed, finite dataset is given. This fixed-data setting is not only important in itself (Lange et al., 2), but also an important component in other RL methods such as experience replay (Lin, 992). The finite-data regime makes it possible to solve policy evaluation more efficiently ith recently developed fast optimization methods based on stochastic variance reduction, such as (Johnson & Zhang, 23) and (Defazio et al., 24). For minimizing strongly convex functions ith a finite-sum structure, such methods enjoy the same lo computational cost per iteration as the classical stochastic gradient method, but also achieve fast, linear convergence rates (i.e., exponential decay of the optimality gap in the objective). Hoever, they cannot be applied directly to minimize the MSPBE, hose objective does not have the finite-sum structure. In this paper, e overcome this obstacle by transforming the empirical MSPBE problem to an equivalent convex-concave saddle-point problem that possesses the desired finite-sum structure. In the saddle-point problem, e consider the model parameters as the primal variables, hich are coupled ith the dual variables through a bilinear term. Moreover, ithout an `2-regularization on the model parameters, the objective is only strongly concave in the dual variables, but not in the primal variables. We propose a primal-dual batch gradient method, as ell as to stochastic variance-reduction methods based on and, respectively. Surprisingly, e sho that hen the coupling matrix is full rank, these algorithms achieve linear convergence in both the primal

2 and dual spaces, despite the lack of strong convexity of the objective in the primal variables. Our results also extend to off-policy learning and TD ith eligibility traces (Sutton & Barto, 998; Precup et al., 2). We note that Balamurugan & Bach (26) have extended both and to solve convex-concave saddlepoint problems ith linear-convergence guarantees. The main difference beteen our results and theirs are Linear convergence in Balamurugan & Bach (26) relies on the assumption that the objective is strongly convex in the primal variables and strongly concave in the dual. Our results sho, somehat surprisingly, that only one of them is necessary if the primal-dual coupling is bilinear and the coupling matrix is full rank. In fact, e are not aare of similar previous results even for the primal-dual batch gradient method, hich e sho in this paper. Even if a strongly convex regularization on the primal variables is introduced to the MSPBE objective, the algorithms in Balamurugan & Bach (26) cannot be applied efficiently. Their algorithms require that the proximal mappings of the strongly convex and concave regularization functions be computed efficiently. In our saddle-point formulation, the strong concavity of the dual variables comes from a quadratic function defined by the feature covariance matrix, hich cannot be inverted efficiently and makes the proximal mapping costly to compute. Instead, our algorithms only use its (stochastic) gradients and hence are much more efficient. We compare various gradient based algorithms on a Random MDP and Mountain Car data sets. The experiments demonstrate the effectiveness of our proposed methods. 2 Preliminaries We consider a Markov Decision Process (MDP) (Puterman, 25) described by (S, A, Pss a, R, ), here S is the set of states, A the set of actions, Pss a the transition probability from state s to state s after taking action a, R (s, a) the reard received after taking action a in state s, and 2 [, ) a discount factor. The goal of an agent is to find an actionselection policy, so that the long-term reard under this policy is maximized. For ease of exposition, e assume S is finite, but none of our results relies on this assumption. A key step in many algorithms in RL is to estimate the value function of a given policy, defined as V (s), E[ P t= t R(s t,a t ) s = s, ]. Let V denote a vector constructed by stacking the values of V (),...,V ( S ) on top of each other. Then V is the unique fixed point of the Bellman operator T : V = T V, R + P V, () here R is the expected reard vector under policy, defined elementise as R (s) =E (a s) R(s, a); and P is the transition matrix induced by the policy applying, defined entryise as P (s, s )=E (a s) P a ss. 2. Mean squared projected Bellman error (MSPBE) One approach to scale up hen the state space size S is large or infinite is to use a linear approximation for V. Formally, e use a feature map : S! R d and approximate the value function by V b (s) = (s) T, here 2 R d is the model parameter to be estimated. Here, e ant to find that minimizes the mean squared projected Bellman error, or MSPBE: MSPBE ( ), 2 kb V T b V k 2, (2) here is a diagonal matrix ith diagonal elements being the stationary distribution over S induced by the policy, and is the eighted projection matrix onto the linear space spanned by (),..., ( S ), that is, = ( T ) T (3) here, [ T (),..., T ( S )] is the matrix obtained by stacking the feature vectors ro by ro. Substituting (3) and () into (2), e obtain (see, e.g., Dann et al., 24) MSPBE( ) = 2 k T ( b V T b V )k 2 ( T ). We can further rerite the above expression for MSPBE as a standard eighted least-squares problem: MSPBE( ) = 2 ka bk2 C, ith properly defined A, b and C, described as follos. Suppose the MDP under policy settles at its stationary distribution and generates an infinite transition sequence {(s t,a t,r t,s t+ )} t=, here s t is the current state, a t is the action, r t is the reard, and s t+ is the next state. Then ith the definitions t, (s t ) and t, (s t+ ), e have A = E[ t( t t ) T ],b= E[ tr t ],C= E[ t T t ], (4) here E[ ] are ith respect to the stationary distribution. Many TD solutions converge to a minimizer of MSPBE in the limit (Tsitsiklis & Van Roy, 997; Dann et al., 24). 2.2 Empirical MSPBE In practice, quantities in (4) are often unknon, and e only have access to a finite dataset ith n transitions D = {(s t,a t,r t,s t+ )} n t=. By replacing the unknon statistics ith their finite-sample estimates, e obtain the Empirical MSPBE, or EM-MSPBE. Specifically, let ba, n nx t= A t, b b, n nx t= b t, b C, n nx C t, (5) t=

3 here for t =,...,n, A t, t ( t t ) T, b t, r t t, C t, t T t. (6) EM-MSPBE ith an optional `2-regularization is given by: here EM-MSPBE ( ) = 2 k b A b bk 2 bc + 2 k k2, (7) is a regularization factor. Observe that (7) is a (regularized) eighted least squares problem. Assuming b C is invertible, its optimal solution is? =( b A > b C b A + I) b A > b C b b. (8) Computing? directly requires O(nd 2 ) operations to form the matrices A, b b and C, b and then O(d 3 ) operations to complete the calculation. This method, knon as leastsquares temporal difference or LSTD (Bradtke & Barto, 996; Boyan, 22), can be very expensive hen n and d are large. One can also skip forming the matrices explicitly and compute? using n recusive rank-one updates (Nedić & Bertsekas, 23). Since each rank-one update costs O(d 2 ), the total cost is O(nd 2 ). In the sequel, e develop efficient algorithms to minimize EM-MSPBE by using stochastic variance reduction methods, hich samples one ( t, t ) per update ithout pre- computing A, b b and C. b These algorithms not only maintain a lo O(d) per-iteration computation cost, but also attain fast linear convergence rates ith a log(/ ) dependence on the desired accuracy. 3 Saddle-Point Formulation of EM-MSPBE Our algorithms (in Section 5) are based on the stochastic variance reduction techniques developed for minimizing a finite sum of convex functions, more specifically, (Johnson & Zhang, 23) and (Defazio et al., 24). They deal ith problems of the form min x2r d f(x), n nx f i (x), (9) i= here each f i is convex. We immediately notice that the EM-MSPBE in (7) cannot be put into such a form, even though the matrices b A, b b and b C have the finite-sum structure given in (5). Thus, extending variance reduction techniques to EM-MSPBE minimization is not straightforard. Nevertheless, e ill sho that the minimizing the EM- MSPBE is equivalent to solving a convex-concave saddlepoint problem hich actually possesses the desired finitesum structure. To proceed, e resort to the machinery of conjugate functions (e.g. Rockafellar, 97, Section 2). For a function f : R d! R, its conjugate function f? : R d! R is defined as f? (y), sup x (y T x f(x)). Note that the conjugate function of 2 kxk2 C b is 2 kyk2 C b, i.e., 2 kyk2 C b = max y T x x 2 kxk2 C b. With this relation, e can rerite EM-MSPBE in (7) as max T ( b b A ) 2 kk2 C b + 2 k k2, so that minimizing EM-MSPBE is equivalent to solving min max L(, ) = nx L t (, ), () 2R d 2R d n here the Lagrangian, defined as t= L(, ), 2 k k2 T b A 2 kk2 b C T b b, () may be decomposed using (5), ith L t (, ), 2 k k2 T A t 2 kk2 Ct T b t. Therefore, minimizing the EM-MSPBE is equivalent to solving the saddle-point problem (), hich is convex in the primal variable and concave in the dual variable. Moreover, it has a finite-sum structure similar to (9). Liu et al. (25) and Valcarcel Macua et al. (25) independently shoed that the algorithm (Sutton et al., 29b) is indeed a stochastic gradient method for solving the saddle-point problem (), although they obtained the saddle-point formulation ith different derivations. More recently, Dai et al. (26) used the conjugate function approach to obtain saddle-point formulations for a more general class of problems and derived primal-dual stochastic gradient algorithms for solving them. Hoever, these algorithms have sublinear convergence rates, hich leaves much room to improve hen applied to problems ith finite datasets. Recently, Lian et al. (27) developed methods for a general finite-sum composition optimization that achieve linear convergence rate. Different from our methods, their stochastic gradients are biased and they have orse dependency on the condition numbers ( 3 and 4 ). The fast linear convergence of our algorithms presented in Sections 4 and 5 requires the folloing assumption: Assumption. A b has full rank, C b is strictly positive definite, and the feature vector t is uniformly bounded. Under mild regularity conditions (e.g., Wasserman, 23, Chapter 5), e have b A and b C converge in probability to A and C defined in (4), respectively. Thus, if the true statistics A is non-singular and C is positive definite, and e have enough training samples, these assumptions are usually satisfied. They have been idely used in previous orks on gradient-based algorithms (e.g., Sutton et al., 29a;b).

4 A direct consequence of Assumption is that? in (8) is the unique minimizer of the EM-MSPBE in (7), even ithout any strongly convex regularization on (i.e., even if =). Hoever, if =, then the Lagrangian L(, ) is only strongly concave in, but not strongly convex in. In this case, e ill sho that non-singularity of the coupling matrix b A can pass an implicit strong convexity on, hich is exploited by our algorithms to obtain linear convergence in both the primal and dual spaces. 4 A Primal-Dual Batch Gradient Method Before diving into the stochastic variance reduction algorithms, e first present Algorithm, hich is a primal-dual batch gradient () algorithm for solving the saddlepoint problem (). In Step 2, the vector B(, ) is obtained by stacking the primal and negative dual gradients: B (, ), r L(, ) r L(, ) " = ba # A b T b + C b. (2) Some notation is needed in order to characterize the convergence rate of Algorithm. For any symmetric and positive definite matrix S, let max(s) and min (S) denote its maximum and minimum eigenvalues respectively, and define its condition number to be (S), max (S)/ min (S). We also define L and µ for any : L, max ( I + b A T b C b A), (3) µ, min ( I + b A T b C b A). (4) By Assumption, e have L µ >. The folloing theorem is proved in Appendix B. Theorem. Suppose Assumption holds and let (?,? ) be the (unique) solution of (). If the step sizes are chosen as = 9L ( C) b and 8 = 9 max( C) b, then the number of iterations of Algorithm to achieve k? k 2 + k? k 2 2 is upper bounded by O I + A b T C b A b ( C) b log. (5) We assigned specific values to the step sizes and for clarity. In general, e can use similar step sizes hile keeping their ratio roughly constant as 8L min( C) b ; see Appendices A and B for more details. In practice, one can use a parameter search on a small subset of data to find reasonable step sizes. It is an interesting open problem ho to automatically select and adjust step sizes. Note that the linear rate is determined by to parts: (i) the strongly convex regularization parameter, and (ii) the positive definiteness of A bt C b A. b The second part could be interpreted as transferring strong concavity in dual variables via the full-rank bi-linear coupling matrix A. b For Algorithm for Policy Evaluation Inputs: initial point (, ), step sizes and, and number of epochs M. : for i =to M do 2: B(, ) here B(, ) is computed according to (2). 3: end for this reason, even if the saddle-point problem () has only strong concavity in dual variables (hen =), the algorithm still enjoys a linear convergence rate. Moreover, even if >, it ill be inefficient to solve problem () using primal-dual algorithms based on proximal mappings of the strongly convex and concave terms (e.g., Chambolle & Pock, 2; Balamurugan & Bach, 26). The reason is that, in (), the strong concavity of the Lagrangian ith respect to the dual lies in the quadratic function (/2)kk bc, hose proximal mapping cannot be computed efficiently. In contrast, the algorithm only needs its gradients. If e pre-compute and store A, b b and C, b hich costs O(nd 2 ) operations, then computing the gradient operator B(, ) in (2) during each iteration of costs O(d 2 ) operations. Alternatively, if e do not ant to store these d d matrices (especially if d is large), then e can compute B(, ) as finite sums on the fly. More specifically, B(, ) = P n n t= B t(, ), here for each t =,...,n, A B t (, ) = t A t b t + C t. (6) Since A t, b t and C t are all rank-one matrices, as given in (6), computing each B t (, ) only requires O(d) operations. Therefore, computing B(, ) costs O(nd) operations as it averages B t (, ) over n samples. 5 Stochastic Variance Reduction Methods If e replace B(, ) in Algorithm (line 2) by the stochastic gradient B t (, ) in (6), then e recover the algorithm of Sutton et al. (29b), applied to a fixed dataset, possibly ith multiple passes. It has a lo periteration cost but a slo, sublinear convergence rate. In this section, e provide to stochastic variance reduction methods and sho they achieve fast linear convergence. 5. for policy evaluation Algorithm 2 is adapted from the stochastic variance reduction gradient () method (Johnson & Zhang, 23). It uses to layers of loops and maintains to sets of parameters (, ) and (, ). In the outer loop, the algorithm computes a full gradient B(, ) using (, ), hich takes

5 Algorithm 2 for Policy Evaluation Inputs: initial point (, ), step sizes {, }, number of outer iterations M, and number of inner iterations N. : for m =to M do 2: Initialize (, ) =(, ) and compute B(, ). 3: for j =to N do 4: Sample an index t j from {,,n} and do 5: Compute B tj (, ) and B tj (, ). 6: B tj (,,, ) here B tj (,,, ) is given in (7). 7: end for 8: end for Algorithm 3 for Policy Evaluation Inputs: initial point (, ), step sizes and, and number of iterations M. : Compute each g t = B t (, ) for t =,...,n. 2: Compute B = B(, ) = P n n t= g t. 3: for m =to M do 4: Sample an index t m from {,,n}. 5: Compute h tm = B tm (, ). 6: 7: B B + n (h t m g tm ) 8: g tm h tm. 9: end for (B + h tm g tm ). O(nd) operations. Afterards, the algorithm executes the inner loop, hich randomly samples an index t j and updates (, ) using variance-reduced stochastic gradient: B tj (,,, ) =B tj (, )+B(, ) B tj (, ). (7) Here, B tj (, ) contains the stochastic gradients at (, ) computed using the random sample ith index t j, and B(, ) B tj (, ) is a term used to reduce the variance in B tj (, ) hile keeping B tj (,,, ) an unbiased estimate of B(, ). Since B(, ) is computed once during each iteration of the outer loop ith cost O(nd) (as explained at the end of Section 4), and each of the N iterations of the inner loop cost O(d) operations, the total computational cost of for each outer loop is O(nd+Nd). We ill present the overall complexity analysis of Algorithm 2 in Section for policy evaluation The second stochastic variance reduction method for policy evaluation is adapted from (Defazio et al., 24); see Algorithm 3. It uses a single loop, and maintains a single set of parameters (, ). Algorithm 3 starts by first computing each component gradients g t = B t (, ) at the initial point, and also form their average B = P n t g t. At each iteration, the algorithm randomly picks an index t m 2 {,...,n} and computes the stochastic gradient h tm = B tm (, ). Then, it updates (, ) using a variance reduced stochastic gradient: B + h tm g tm, here g tm is the previously computed stochastic gradient using the t m -th sample (associated ith certain past values of and ). Afterards, it updates the batch gradient estimate B as B + n (h t m g tm ) and replaces g tm ith h tm. As Algorithm 3 proceeds, different vectors g t are computed using different values of and (depending on hen the index t as sampled). So in general e need to store all vectors g t, for t =,...,n, to facilitate individual updates, hich ill cost additional O(nd) storage. Hoever, by exploiting the rank-one structure in (6), e only need to store three scalars ( t ) T, ( t ) T, and T t, and form g tm on the fly using O(d) computation. Overall, each iteration of costs O(d) operations. 5.3 Theoretical analyses of and In order to study the convergence properties of and for policy evaluation, e introduce a smoothness parameter L G based on the stochastic gradients B t (, ). Let = / be the ratio beteen the primal and dual step-sizes, and define a pair of eighted Euclidean norms (, ), (k k 2 + kk 2 ) /2, (, ), (k k 2 + kk 2 ) /2. Note that (, ) upper bounds the error in optimizing : (?,? ) k? k. Therefore, any bound on (?,? ) applies automatically to k? k. Next, e define the parameter L G through its square: L 2 G, P n n t= sup B t (, ) B t ( 2, 2 ) 2,, 2, 2 ( 2, 2 ) 2. This definition is similar to the smoothness constant L used in Balamurugan & Bach (26) except that e used the step-size ratio rather than the strong convexity and concavity parameters of the Lagrangian to define and. Substituting the definition of B t (, ) in (6), e have L 2 G = n nx G T t G t, here G t, t= I p A T t p At C t. (8) With the above definitions, e characterize the convergence of ( m?, m? ), here (?,? ) is the solution of (), and ( m, m ) is the output of the algorithms Since our saddle-point problem is not necessarily strongly convex in (hen =), e could not define and in the same ay as Balamurugan & Bach (26).

6 after the m-th iteration. For, it is the m-th outer iteration in Algorithm 2. The folloing to theorems are proved in Appendices C and D, respectively. Theorem 2 (Convergence rate of ). Suppose Assumption holds. If e choose = µ, = 8L min( C) b 48( b C)L 2 G, N = 52 ( C)L b 2 G µ, here L 2 and µ are defined in (3) and (4), then E ( m?, m? ) 2 4 m (?,? ) 2. 5 The overall computational cost for reaching E ( m?, m? ) is upper bounded by O n + ( C)L b 2 G 2 min ( I + A bt C b A) b d log. (9) Theorem 3 (Convergence rate of ). Suppose Assumption holds. If e choose = and = 8L min( b C) in Algorithm 3, then µ 3(8 2 ( b C)L 2 G +nµ2 ) E ( m?, m? ) 2 2( ) m (?,? ) 2, here µ 2 9(8 2 ( C)L b 2 G +nµ2 ). The total cost to achieve E ( m?, m? ) has the same bound in (9). Similar to our results in (5), both the and algorithms for policy evaluation enjoy linear convergence even if there is no strong convexity in the saddlepoint problem () (i.e., hen =). This is mainly due to the positive definiteness of b A T b C b A hen b C is positivedefinite and b A is full-rank. In contrast, the linear convergence of and in Balamurugan & Bach (26) requires the Lagrangian to be both strongly convex in and strongly concave in. Moreover, in the policy evaluation problem, the strong concavity ith respect to the dual variable comes from a eighted quadratic norm (/2)kk bc, hich does not admit an efficient proximal mapping as required by the proximal versions of and in Balamurugan & Bach (26). Our algorithms only require computing the stochastic gradients of this function, hich is easy to do due to its finite sum structure. Balamurugan & Bach (26) also proposed accelerated variants of and using the catalyst frameork of Lin et al. (25). Such extensions can be done similarly for the three algorithms presented in this paper, and e omit the details due to space limit. 6 Comparison of Different Algorithms This section compares the computation complexities of several representative policy-evaluation algorithms that minimize EM-MSPBE, as summarized in Table. Table. Complexity of different policy evaluation algorithms. In the table, d is feature dimension, n is dataset size,, ( I + ba T b C b A); G, L G/ min( I + b A T b C b A); and is a condition number related to. Algorithm / -(I) -(II) Total Complexity O nd + ( b C) 2 G n log / O (d / ) O nd ( C) b log(/ ) O nd 2 + d 2 ( C) b log(/ ) LSTD O nd 2 or O nd 2 + d 3 The upper part of the table lists algorithms hose complexity is linear in feature dimension d, including the to ne algorithms presented in the previous section. We can also apply to a finite dataset ith samples dran uniformly at random ith replacement. It costs O(d) per iteration, but has a sublinear convergence rate regarding. In practice, people may choose = (/n) for generalization reasons (see, e.g., Lazaric et al. (2)), leading to an O( nd) overall complexity for, here is a condition number related to the algorithm. Hoever, as verified by our experiments, the bounds in the table sho that our /-based algorithms are much faster as their effective condition numbers vanish hen n becomes large. TDC has a similar complexity to. In the table, e list to different implementations of. -(I) computes the gradients by averaging the stochastic gradients over the entire dataset at each iteration, hich costs O(nd) operations; see discussions at the end of Section 4. -(II) first pre-computes the matrices A, b b and C b using O(nd 2 ) operations, then computes the batch gradient at each iteration ith O(d 2 ) operations. If d is very large (e.g., hen d n), then -(I) ould have an advantage over -(II). The loer part of the table also includes LSTD, hich has O(nd 2 ) complexity if rankone updates are used. and are more efficient than the other algorithms, hen either d or n is very large. In particular, they have a loer complexity than LSTD hen d> ( + ( C) b 2 G n ) log, This condition is easy to satisfy, hen n is very large. On the other hand, and algorithms are more efficient than -(I) if n is large, say n>( C) b 2 G ( C) b, here and G are described in the caption of Table. There are other algorithms hose complexity scales linearly ith n and d, including ilstd (Geramifard et al., 27), and TDC (Sutton et al., 29b), flstd- SA (Prashanth et al., 24), and the more recent algorithms of Wang et al. (26) and Dai et al. (26). Hoever, their

7 convergence is slo: the number of iterations required to reach a desired accuracy gros as / or orse. The CTD algorithm (Korda & Prashanth, 25) uses a similar idea as to reduce variance in TD updates. This algorithm is shon to have a similar linear convergence rate in an online setting here the data stream is generated by a Markov process ith finite states and exponential mixing. The method solves for a fixed-point solution by stochastic approximation. As a result, they can be non-convergent in off-policy learning, hile our algorithms remain stable (c.f., Section 7.). 7 Extensions It is possible to extend our approach to accelerate optimization of other objectives such as MSBE and NEU (Dann et al., 24). In this section, e briefly describe to extensions of the algorithms developed earlier. 7. Off-policy learning In some cases, e may ant to estimate the value function of a policy from a set of data D generated by a different behavior policy b. This is called off-policy learning (Sutton & Barto, 998, Chapter 8). In the off-policy case, samples are generated from the distribution induced by the behavior policy b, not the the target policy. While such a mismatch often causes stochastic-approximation-based methods to diverge (Tsitsiklis & Van Roy, 997), our gradient-based algorithms remain convergent ith the same (fast) convergence rate. Consider the RL frameork outlined in Section 2. For each state-action pair (s t,a t ) such that b (a t s t ) >, e define the importance ratio, t, (a t s t )/ b (a t s t ). The EM- MSPBE for off-policy learning has the same expression as in (7) except that A t, b t and C t are modified by the eight factor t, as listed in Table 2; see also Liu et al. (25, Eqn 6) for a related discussion.) Algorithms 3 remain the same for the off-policy case after A t, b t and C t are modified correspondingly. 7.2 Learning ith eligibility traces Eligibility traces are a useful technique to trade off bias and variance in TD learning (Singh & Sutton, 996; Kearns & Singh, 2). When they are used, e can pre-compute z t in Table 2 before running our ne algorithms. Note that EM-MSPBE ith eligibility traces has the same form of (7), ith A t, b t and C t defined differently according to the last ro of Table 2. At the m-th step of the learning process, the algorithm randomly samples z tm, t m, tm and r tm from the fixed dataset and computes the corresponding stochastic gradients, here the index t m is uniformly distributed over {,...,n} and are independent for different values of m. Algorithms 3 immediately ork for this case, enjoying a similar linear convergence rate and a com- Table 2. Expressions of A t, b t and C t for different cases of policy evaluation. Here, t, (a t s t)/ b (a t s t); and z t, P t i= ( )t i i, here is a given parameter. A t b t C t On-policy t( t t ) > r t t t > t Off-policy t t( t t ) > tr t t t > t Eligibility trace z t( t t ) > r tz t t > t putation complexity linear in n and d. We need additional O(nd) operations to pre-compute z t recursively and an additional O(nd) storage for z t. Hoever, it does not change the order of the total complexity for /. 8 Experiments In this section, e compare the folloing algorithms on to benchmark problems: (i) (Algorithm ); (ii) ith samples dran randomly ith replacement from a dataset; (iii) TD: the flstd-sa algorithm of Prashanth et al. (24); (iv) (Algorithm 2); and (v) (Algorithm 3). Note that hen >, the TD solution and EM-MSPBE minimizer differ, so e do not include TD. For step size tuning, is chosen from, 2,..., 6 L ( C) b and is chosen from,, 2 max( C) b. We only report the results of each algorithm hich correspond to the best-tuned step sizes; for e choose N =2n. In the first task, e consider a randomly generated MDP ith 4 states and actions (Dann et al., 24). The transition probabilities are defined as P (s a, s) / p a ss + 5, here p a ss U[, ]. The data-generating policy and start distribution ere generated in a similar ay. Each state is represented by a 2-dimensional feature vector, here 2 of the features ere sampled from a uniform distribution, and the last feature as constant one. We chose =.95. Fig. shos the performance of various algorithms for n = 2. First, notice that the stochastic variance methods converge much faster than others. In fact, our proposed methods achieve linear convergence. Second, as e increase, the performances of, and improve significantly due to better conditioning, as predicted by our theoretical results. Next, e test these algorithms on Mountain Car (Sutton & Barto, 998, Chapter 8). To collect the dataset, e first ran Sarsa ith d = 3 CMAC features to obtain a good policy. Then, e ran this policy to collect trajectories that comprise the dataset. Figs. 2 and 3 sho our proposed stochastic variance reduction methods dominate other first-order methods. Moreover, ith better conditioning (through a larger ),, and achieve faster convergence rate. Finally, as e increase sample size n, and converge faster. This simulation verifies our

8 log TD (a) = log q (b) = max( A b> C b A) b log (c) = max ba > b C b A Figure. Random MDP ith s =4, d =2, and n =2. log TD (a) = log (b) =. max( b A > b C b A) log (c) = max( b A > b C b A) Figure 2. Mountain Car Data Set ith d =3and n =5. log TD (a) = log (b) =. max( b A > b C b A) log (c) = max( b A > b C b A) Figure 3. Mountain Car Data Set ith d =3and n =2. theoretical finding in Table that / need feer epochs for large n. 9 Conclusions In this paper, e reformulated the EM-MSPBE minimization problem in policy evaluation into an empirical saddlepoint problem, and developed and analyzed a batch gradient method and to first-order stochastic variance reduction methods to solve the problem. An important result e obtained is that even hen the reformulated saddle-point problem lacks strong convexity in primal variables and has only strong concavity in dual variables, the proposed algorithms are still able to achieve a linear convergence rate. We are not aare of any similar results for primal-dual batch gradient methods or stochastic variance reduction methods. Furthermore, e shoed that hen both the feature dimension d and the number of samples n are large, the developed stochastic variance reduction methods are more efficient than any other gradient-based methods hich are convergent in off-policy settings. This ork leads to several interesting directions for research. First, e believe it is important to extend the stochastic variance reduction methods to nonlinear approximation paradigms (Bhatnagar et al., 29), especially ith deep neural netorks. Moreover, it remains an important open problem ho to apply stochastic variance reduction techniques to policy optimization.

9 References Balamurugan, P and Bach, Francis. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems 29, pp , 26. Bertsekas, Dimitri P and Tsitsiklis, John N. Neurodynamic programming: An overvie. In Decision and Control, 995., Proceedings of the 34th IEEE Conference on, volume, pp IEEE, 995. Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S, Maei, Hamid R, and Szepesvári, Csaba. Convergent temporal-difference learning ith arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp , 29. Boyan, Justin A. Technical update: Least-squares temporal difference learning. Machine Learning, 49: , 22. Bradtke, Steven J and Barto, Andre G. Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22:33 57, 996. Chambolle, Antonin and Pock, Thomas. A first-order primal-dual algorithm for convex problems ith applications to imaging. Journal of Mathematical Imaging and Vision, 4():2 45, 2. Dai, Bo, He, Niao, Pan, Yunpeng, Boots, Byron, and Song, Le. Learning from conditional distributions via dual embeddings. arxiv: , 26. Dann, Christoph, Neumann, Gerhard, and Peters, Jan. Policy evaluation ith temporal differences: a survey and comparison. Journal of Machine Learning Research, 5 ():89 883, 24. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. : A fast incremental gradient method ith support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp , 24. Geramifard, Alborz, Boling, Michael H., Zinkevich, Martin, and Sutton, Richard S. ilstd: Eligibility traces and convergence analysis. In Advances in Neural Information Processing Systems 9, pp , 27. Gohberg, Israel, Lancaster, Peter, and Rodman, Leiba. Indefinite linear algebra and applications. Springer Science & Business Media, 26. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp , 23. Kearns, Michael J. and Singh, Satinder P. Bias-variance error bounds for temporal difference updates. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT-), pp , 2. Korda, Nathaniel and Prashanth, L.A. On TD() ith function approximation: Concentration bounds and a centered variant ith exponential convergence. In Proceedings of the Thirty-Second International Conference on Machine Learning (ICML-5), pp , 25. Lagoudakis, Michail G and Parr, Ronald. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):7 49, 23. Lange, Sascha, Gabel, Thomas, and Riedmiller, Martin. Batch reinforcement learning. In Wiering, Marco and van Otterlo, Martijn (eds.), Reinforcement Learning: State of the Art, pp Springer Verlag, 2. Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Munos, Rémi. Finite-sample analysis of LSTD. In Proceedings of the Tenty-Seventh International Conference on Machine Learning, pp , 2. Lian, Xiangru, Wang, Mengdi, and Liu, Ji. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of Artificial Intelligence and Statistics Conference (AISTATS), pp , 27. Liesen, Jörg and Parlett, Beresford N. On nonsymmetric saddle point matrices that allo conjugate gradient iterations. Numerische Mathematik, 8(4):65 624, 28. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (NIPS) 28, pp , 25. Lin, Long-Ji. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3 4):293 32, 992. Liu, Bo, Liu, Ji, Ghavamzadeh, Mohammad, Mahadevan, Sridhar, and Petrik, Marek. Finite-sample analysis of proximal gradient TD algorithms. In Proc. The 3st Conf. Uncertainty in Artificial Intelligence, Amsterdam, Netherlands, 25. Nedić, A. and Bertsekas, Dimitri P. Least squares policy evaluation algorithms ith linear function approximation. Discrete Event Dynamics Systems: Theory and Applications, 3():79, 23. Prashanth, LA, Korda, Nathaniel, and Munos, Rémi. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Joint European

10 Conference on Machine Learning and Knoledge Discovery in Databases, pp Springer, 24. Precup, Doina, Sutton, Richard S., and Dasgupta, Sanjoy. Off-policy temporal-difference learning ith funtion approximation. In Proceedings of the Eighteenth Conference on Machine Learning (ICML-), pp , 2. Wasserman, Larry. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, 23. Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 25. Rockafellar, R. Tyrrell. Convex Analysis. Princeton University Press, 97. Shen, Shu-Qian, Huang, Ting-Zhu, and Cheng, Guang- Hui. A condition for the nonsymmetric saddle point matrix being diagonalizable and having real and positive eigenvalues. Journal of Computational and Applied Mathematics, 22():8 2, 28. Singh, Satinder P. and Sutton, Richard S. Reinforcement learning ith replacing eligibility traces. Machine Learning, 22( 3):23 58, 996. Sutton, Richard S and Barto, Andre G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 998. Sutton, Richard S, Maei, Hamid R, and Szepesvári, Csaba. A convergent o(n) temporal-difference algorithm for off-policy learning ith linear function approximation. In Advances in neural information processing systems, pp , 29a. Sutton, Richard S, Maei, Hamid Reza, Precup, Doina, Bhatnagar, Shalabh, Silver, David, Szepesvári, Csaba, and Wieiora, Eric. Fast gradient-descent methods for temporal-difference learning ith linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pp ACM, 29b. Tsitsiklis, John N. and Van Roy, Benjamin. An analysis of temporal-difference learning ith function approximation. IEEE Transactions on Automatic Control, 42: , 997. Valcarcel Macua, Sergio, Chen, Jianshu, Zazo, Santiago, and Sayed, Ali H. Distributed policy evaluation under multiple behavior strategies. Automatic Control, IEEE Transactions on, 6(5):26 274, 25. Wang, Mengdi, Liu, Ji, and Fang, Ethan. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems (NIPS) 29, pp , 26.

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca