Stochastic Variance Reduction Methods for Policy Evaluation

Size: px
Start display at page:

Download "Stochastic Variance Reduction Methods for Policy Evaluation"

Transcription

1 Simon S. Du Jianshu Chen 2 Lihong Li 2 Lin Xiao 2 Dengyong Zhou 2 Abstract Policy evaluation is concerned ith estimating the value function that predicts long-term values of states under a given policy. It is a crucial step in many reinforcement-learning algorithms. In this paper, e focus on policy evaluation ith linear function approximation over a fixed dataset. We first transform the empirical policy evaluation problem into a (quadratic) convex-concave saddle-point problem, and then present a primal-dual batch gradient method, as ell as to stochastic variance reduction methods for solving the problem. These algorithms scale linearly in both sample size and feature dimension. Moreover, they achieve linear convergence even hen the saddle-point problem has only strong concavity in the dual variables but no strong convexity in the primal variables. Numerical experiments on benchmark problems demonstrate the effectiveness of our methods. Introduction Reinforcement learning (RL) is a poerful learning paradigm for sequential decision making (see, e.g., Bertsekas & Tsitsiklis, 995; Sutton & Barto, 998). An RL agent interacts ith the environment by repeatedly observing the current state, taking an action according to a certain policy, receiving a reard signal and transitioning to a next state. A policy specifies hich action to take given the current state. Policy evaluation estimates a value function that predicts expected cumulative reard the agent ould receive by folloing a fixed policy starting at a certain state. In addition to quantifying long-term values of states, hich can be of interest on its on, value functions also provide Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 523, USA. 2 Microsoft Research, Redmond, Washington 9852, USA.. Correspondence to: Simon S. Du <ssdu@cs.cmu.edu>, Jianshu Chen <jianshuc@microsoft.com>, Lihong Li <lihongli@microsoft.com>, Lin Xiao <lin.xiao@microsoft.com>, Dengyong Zhou <denzho@microsoft.com>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 7, 27. Copyright 27 by the author(s). important information for the agent to optimize its policy. For example, policy-iteration algorithms iterate beteen policy-evaluation steps and policy-improvement steps, until a (near-)optimal policy is found (Bertsekas & Tsitsiklis, 995; Lagoudakis & Parr, 23). Therefore, estimating the value function efficiently and accurately is essential in RL. There has been substantial ork on policy evaluation, ith temporal-difference (TD) methods being perhaps the most popular. These methods use the Bellman equation to bootstrap the estimation process. Different cost functions are formulated to exploit this idea, leading to different policy evaluation algorithms; see Dann et al. (24) for a comprehensive survey. In this paper, e study policy evaluation by minimizing the mean squared projected Bellman error (MSPBE) ith linear approximation of the value function. We focus on the batch setting here a fixed, finite dataset is given. This fixed-data setting is not only important in itself (Lange et al., 2), but also an important component in other RL methods such as experience replay (Lin, 992). The finite-data regime makes it possible to solve policy evaluation more efficiently ith recently developed fast optimization methods based on stochastic variance reduction, such as (Johnson & Zhang, 23) and (Defazio et al., 24). For minimizing strongly convex functions ith a finite-sum structure, such methods enjoy the same lo computational cost per iteration as the classical stochastic gradient method, but also achieve fast, linear convergence rates (i.e., exponential decay of the optimality gap in the objective). Hoever, they cannot be applied directly to minimize the MSPBE, hose objective does not have the finite-sum structure. In this paper, e overcome this obstacle by transforming the empirical MSPBE problem to an equivalent convex-concave saddle-point problem that possesses the desired finite-sum structure. In the saddle-point problem, e consider the model parameters as the primal variables, hich are coupled ith the dual variables through a bilinear term. Moreover, ithout an `2-regularization on the model parameters, the objective is only strongly concave in the dual variables, but not in the primal variables. We propose a primal-dual batch gradient method, as ell as to stochastic variance-reduction methods based on and, respectively. Surprisingly, e sho that hen the coupling matrix is full rank, these algorithms achieve linear convergence in both the primal

2 and dual spaces, despite the lack of strong convexity of the objective in the primal variables. Our results also extend to off-policy learning and TD ith eligibility traces (Sutton & Barto, 998; Precup et al., 2). We note that Balamurugan & Bach (26) have extended both and to solve convex-concave saddlepoint problems ith linear-convergence guarantees. The main difference beteen our results and theirs are Linear convergence in Balamurugan & Bach (26) relies on the assumption that the objective is strongly convex in the primal variables and strongly concave in the dual. Our results sho, somehat surprisingly, that only one of them is necessary if the primal-dual coupling is bilinear and the coupling matrix is full rank. In fact, e are not aare of similar previous results even for the primal-dual batch gradient method, hich e sho in this paper. Even if a strongly convex regularization on the primal variables is introduced to the MSPBE objective, the algorithms in Balamurugan & Bach (26) cannot be applied efficiently. Their algorithms require that the proximal mappings of the strongly convex and concave regularization functions be computed efficiently. In our saddle-point formulation, the strong concavity of the dual variables comes from a quadratic function defined by the feature covariance matrix, hich cannot be inverted efficiently and makes the proximal mapping costly to compute. Instead, our algorithms only use its (stochastic) gradients and hence are much more efficient. We compare various gradient based algorithms on a Random MDP and Mountain Car data sets. The experiments demonstrate the effectiveness of our proposed methods. 2 Preliminaries We consider a Markov Decision Process (MDP) (Puterman, 25) described by (S, A, Pss a, R, ), here S is the set of states, A the set of actions, Pss a the transition probability from state s to state s after taking action a, R (s, a) the reard received after taking action a in state s, and 2 [, ) a discount factor. The goal of an agent is to find an actionselection policy, so that the long-term reard under this policy is maximized. For ease of exposition, e assume S is finite, but none of our results relies on this assumption. A key step in many algorithms in RL is to estimate the value function of a given policy, defined as V (s), E[ P t= t R(s t,a t ) s = s, ]. Let V denote a vector constructed by stacking the values of V (),...,V ( S ) on top of each other. Then V is the unique fixed point of the Bellman operator T : V = T V, R + P V, () here R is the expected reard vector under policy, defined elementise as R (s) =E (a s) R(s, a); and P is the transition matrix induced by the policy applying, defined entryise as P (s, s )=E (a s) P a ss. 2. Mean squared projected Bellman error (MSPBE) One approach to scale up hen the state space size S is large or infinite is to use a linear approximation for V. Formally, e use a feature map : S! R d and approximate the value function by V b (s) = (s) T, here 2 R d is the model parameter to be estimated. Here, e ant to find that minimizes the mean squared projected Bellman error, or MSPBE: MSPBE ( ), 2 kb V T b V k 2, (2) here is a diagonal matrix ith diagonal elements being the stationary distribution over S induced by the policy, and is the eighted projection matrix onto the linear space spanned by (),..., ( S ), that is, = ( T ) T (3) here, [ T (),..., T ( S )] is the matrix obtained by stacking the feature vectors ro by ro. Substituting (3) and () into (2), e obtain (see, e.g., Dann et al., 24) MSPBE( ) = 2 k T ( b V T b V )k 2 ( T ). We can further rerite the above expression for MSPBE as a standard eighted least-squares problem: MSPBE( ) = 2 ka bk2 C, ith properly defined A, b and C, described as follos. Suppose the MDP under policy settles at its stationary distribution and generates an infinite transition sequence {(s t,a t,r t,s t+ )} t=, here s t is the current state, a t is the action, r t is the reard, and s t+ is the next state. Then ith the definitions t, (s t ) and t, (s t+ ), e have A = E[ t( t t ) T ],b= E[ tr t ],C= E[ t T t ], (4) here E[ ] are ith respect to the stationary distribution. Many TD solutions converge to a minimizer of MSPBE in the limit (Tsitsiklis & Van Roy, 997; Dann et al., 24). 2.2 Empirical MSPBE In practice, quantities in (4) are often unknon, and e only have access to a finite dataset ith n transitions D = {(s t,a t,r t,s t+ )} n t=. By replacing the unknon statistics ith their finite-sample estimates, e obtain the Empirical MSPBE, or EM-MSPBE. Specifically, let ba, n nx t= A t, b b, n nx t= b t, b C, n nx C t, (5) t=

3 here for t =,...,n, A t, t ( t t ) T, b t, r t t, C t, t T t. (6) EM-MSPBE ith an optional `2-regularization is given by: here EM-MSPBE ( ) = 2 k b A b bk 2 bc + 2 k k2, (7) is a regularization factor. Observe that (7) is a (regularized) eighted least squares problem. Assuming b C is invertible, its optimal solution is? =( b A > b C b A + I) b A > b C b b. (8) Computing? directly requires O(nd 2 ) operations to form the matrices A, b b and C, b and then O(d 3 ) operations to complete the calculation. This method, knon as leastsquares temporal difference or LSTD (Bradtke & Barto, 996; Boyan, 22), can be very expensive hen n and d are large. One can also skip forming the matrices explicitly and compute? using n recusive rank-one updates (Nedić & Bertsekas, 23). Since each rank-one update costs O(d 2 ), the total cost is O(nd 2 ). In the sequel, e develop efficient algorithms to minimize EM-MSPBE by using stochastic variance reduction methods, hich samples one ( t, t ) per update ithout pre- computing A, b b and C. b These algorithms not only maintain a lo O(d) per-iteration computation cost, but also attain fast linear convergence rates ith a log(/ ) dependence on the desired accuracy. 3 Saddle-Point Formulation of EM-MSPBE Our algorithms (in Section 5) are based on the stochastic variance reduction techniques developed for minimizing a finite sum of convex functions, more specifically, (Johnson & Zhang, 23) and (Defazio et al., 24). They deal ith problems of the form min x2r d f(x), n nx f i (x), (9) i= here each f i is convex. We immediately notice that the EM-MSPBE in (7) cannot be put into such a form, even though the matrices b A, b b and b C have the finite-sum structure given in (5). Thus, extending variance reduction techniques to EM-MSPBE minimization is not straightforard. Nevertheless, e ill sho that the minimizing the EM- MSPBE is equivalent to solving a convex-concave saddlepoint problem hich actually possesses the desired finitesum structure. To proceed, e resort to the machinery of conjugate functions (e.g. Rockafellar, 97, Section 2). For a function f : R d! R, its conjugate function f? : R d! R is defined as f? (y), sup x (y T x f(x)). Note that the conjugate function of 2 kxk2 C b is 2 kyk2 C b, i.e., 2 kyk2 C b = max y T x x 2 kxk2 C b. With this relation, e can rerite EM-MSPBE in (7) as max T ( b b A ) 2 kk2 C b + 2 k k2, so that minimizing EM-MSPBE is equivalent to solving min max L(, ) = nx L t (, ), () 2R d 2R d n here the Lagrangian, defined as t= L(, ), 2 k k2 T b A 2 kk2 b C T b b, () may be decomposed using (5), ith L t (, ), 2 k k2 T A t 2 kk2 Ct T b t. Therefore, minimizing the EM-MSPBE is equivalent to solving the saddle-point problem (), hich is convex in the primal variable and concave in the dual variable. Moreover, it has a finite-sum structure similar to (9). Liu et al. (25) and Valcarcel Macua et al. (25) independently shoed that the algorithm (Sutton et al., 29b) is indeed a stochastic gradient method for solving the saddle-point problem (), although they obtained the saddle-point formulation ith different derivations. More recently, Dai et al. (26) used the conjugate function approach to obtain saddle-point formulations for a more general class of problems and derived primal-dual stochastic gradient algorithms for solving them. Hoever, these algorithms have sublinear convergence rates, hich leaves much room to improve hen applied to problems ith finite datasets. Recently, Lian et al. (27) developed methods for a general finite-sum composition optimization that achieve linear convergence rate. Different from our methods, their stochastic gradients are biased and they have orse dependency on the condition numbers ( 3 and 4 ). The fast linear convergence of our algorithms presented in Sections 4 and 5 requires the folloing assumption: Assumption. A b has full rank, C b is strictly positive definite, and the feature vector t is uniformly bounded. Under mild regularity conditions (e.g., Wasserman, 23, Chapter 5), e have b A and b C converge in probability to A and C defined in (4), respectively. Thus, if the true statistics A is non-singular and C is positive definite, and e have enough training samples, these assumptions are usually satisfied. They have been idely used in previous orks on gradient-based algorithms (e.g., Sutton et al., 29a;b).

4 A direct consequence of Assumption is that? in (8) is the unique minimizer of the EM-MSPBE in (7), even ithout any strongly convex regularization on (i.e., even if =). Hoever, if =, then the Lagrangian L(, ) is only strongly concave in, but not strongly convex in. In this case, e ill sho that non-singularity of the coupling matrix b A can pass an implicit strong convexity on, hich is exploited by our algorithms to obtain linear convergence in both the primal and dual spaces. 4 A Primal-Dual Batch Gradient Method Before diving into the stochastic variance reduction algorithms, e first present Algorithm, hich is a primal-dual batch gradient () algorithm for solving the saddlepoint problem (). In Step 2, the vector B(, ) is obtained by stacking the primal and negative dual gradients: B (, ), r L(, ) r L(, ) " = ba # A b T b + C b. (2) Some notation is needed in order to characterize the convergence rate of Algorithm. For any symmetric and positive definite matrix S, let max(s) and min (S) denote its maximum and minimum eigenvalues respectively, and define its condition number to be (S), max (S)/ min (S). We also define L and µ for any : L, max ( I + b A T b C b A), (3) µ, min ( I + b A T b C b A). (4) By Assumption, e have L µ >. The folloing theorem is proved in Appendix B. Theorem. Suppose Assumption holds and let (?,? ) be the (unique) solution of (). If the step sizes are chosen as = 9L ( C) b and 8 = 9 max( C) b, then the number of iterations of Algorithm to achieve k? k 2 + k? k 2 2 is upper bounded by O I + A b T C b A b ( C) b log. (5) We assigned specific values to the step sizes and for clarity. In general, e can use similar step sizes hile keeping their ratio roughly constant as 8L min( C) b ; see Appendices A and B for more details. In practice, one can use a parameter search on a small subset of data to find reasonable step sizes. It is an interesting open problem ho to automatically select and adjust step sizes. Note that the linear rate is determined by to parts: (i) the strongly convex regularization parameter, and (ii) the positive definiteness of A bt C b A. b The second part could be interpreted as transferring strong concavity in dual variables via the full-rank bi-linear coupling matrix A. b For Algorithm for Policy Evaluation Inputs: initial point (, ), step sizes and, and number of epochs M. : for i =to M do 2: B(, ) here B(, ) is computed according to (2). 3: end for this reason, even if the saddle-point problem () has only strong concavity in dual variables (hen =), the algorithm still enjoys a linear convergence rate. Moreover, even if >, it ill be inefficient to solve problem () using primal-dual algorithms based on proximal mappings of the strongly convex and concave terms (e.g., Chambolle & Pock, 2; Balamurugan & Bach, 26). The reason is that, in (), the strong concavity of the Lagrangian ith respect to the dual lies in the quadratic function (/2)kk bc, hose proximal mapping cannot be computed efficiently. In contrast, the algorithm only needs its gradients. If e pre-compute and store A, b b and C, b hich costs O(nd 2 ) operations, then computing the gradient operator B(, ) in (2) during each iteration of costs O(d 2 ) operations. Alternatively, if e do not ant to store these d d matrices (especially if d is large), then e can compute B(, ) as finite sums on the fly. More specifically, B(, ) = P n n t= B t(, ), here for each t =,...,n, A B t (, ) = t A t b t + C t. (6) Since A t, b t and C t are all rank-one matrices, as given in (6), computing each B t (, ) only requires O(d) operations. Therefore, computing B(, ) costs O(nd) operations as it averages B t (, ) over n samples. 5 Stochastic Variance Reduction Methods If e replace B(, ) in Algorithm (line 2) by the stochastic gradient B t (, ) in (6), then e recover the algorithm of Sutton et al. (29b), applied to a fixed dataset, possibly ith multiple passes. It has a lo periteration cost but a slo, sublinear convergence rate. In this section, e provide to stochastic variance reduction methods and sho they achieve fast linear convergence. 5. for policy evaluation Algorithm 2 is adapted from the stochastic variance reduction gradient () method (Johnson & Zhang, 23). It uses to layers of loops and maintains to sets of parameters (, ) and (, ). In the outer loop, the algorithm computes a full gradient B(, ) using (, ), hich takes

5 Algorithm 2 for Policy Evaluation Inputs: initial point (, ), step sizes {, }, number of outer iterations M, and number of inner iterations N. : for m =to M do 2: Initialize (, ) =(, ) and compute B(, ). 3: for j =to N do 4: Sample an index t j from {,,n} and do 5: Compute B tj (, ) and B tj (, ). 6: B tj (,,, ) here B tj (,,, ) is given in (7). 7: end for 8: end for Algorithm 3 for Policy Evaluation Inputs: initial point (, ), step sizes and, and number of iterations M. : Compute each g t = B t (, ) for t =,...,n. 2: Compute B = B(, ) = P n n t= g t. 3: for m =to M do 4: Sample an index t m from {,,n}. 5: Compute h tm = B tm (, ). 6: 7: B B + n (h t m g tm ) 8: g tm h tm. 9: end for (B + h tm g tm ). O(nd) operations. Afterards, the algorithm executes the inner loop, hich randomly samples an index t j and updates (, ) using variance-reduced stochastic gradient: B tj (,,, ) =B tj (, )+B(, ) B tj (, ). (7) Here, B tj (, ) contains the stochastic gradients at (, ) computed using the random sample ith index t j, and B(, ) B tj (, ) is a term used to reduce the variance in B tj (, ) hile keeping B tj (,,, ) an unbiased estimate of B(, ). Since B(, ) is computed once during each iteration of the outer loop ith cost O(nd) (as explained at the end of Section 4), and each of the N iterations of the inner loop cost O(d) operations, the total computational cost of for each outer loop is O(nd+Nd). We ill present the overall complexity analysis of Algorithm 2 in Section for policy evaluation The second stochastic variance reduction method for policy evaluation is adapted from (Defazio et al., 24); see Algorithm 3. It uses a single loop, and maintains a single set of parameters (, ). Algorithm 3 starts by first computing each component gradients g t = B t (, ) at the initial point, and also form their average B = P n t g t. At each iteration, the algorithm randomly picks an index t m 2 {,...,n} and computes the stochastic gradient h tm = B tm (, ). Then, it updates (, ) using a variance reduced stochastic gradient: B + h tm g tm, here g tm is the previously computed stochastic gradient using the t m -th sample (associated ith certain past values of and ). Afterards, it updates the batch gradient estimate B as B + n (h t m g tm ) and replaces g tm ith h tm. As Algorithm 3 proceeds, different vectors g t are computed using different values of and (depending on hen the index t as sampled). So in general e need to store all vectors g t, for t =,...,n, to facilitate individual updates, hich ill cost additional O(nd) storage. Hoever, by exploiting the rank-one structure in (6), e only need to store three scalars ( t ) T, ( t ) T, and T t, and form g tm on the fly using O(d) computation. Overall, each iteration of costs O(d) operations. 5.3 Theoretical analyses of and In order to study the convergence properties of and for policy evaluation, e introduce a smoothness parameter L G based on the stochastic gradients B t (, ). Let = / be the ratio beteen the primal and dual step-sizes, and define a pair of eighted Euclidean norms (, ), (k k 2 + kk 2 ) /2, (, ), (k k 2 + kk 2 ) /2. Note that (, ) upper bounds the error in optimizing : (?,? ) k? k. Therefore, any bound on (?,? ) applies automatically to k? k. Next, e define the parameter L G through its square: L 2 G, P n n t= sup B t (, ) B t ( 2, 2 ) 2,, 2, 2 ( 2, 2 ) 2. This definition is similar to the smoothness constant L used in Balamurugan & Bach (26) except that e used the step-size ratio rather than the strong convexity and concavity parameters of the Lagrangian to define and. Substituting the definition of B t (, ) in (6), e have L 2 G = n nx G T t G t, here G t, t= I p A T t p At C t. (8) With the above definitions, e characterize the convergence of ( m?, m? ), here (?,? ) is the solution of (), and ( m, m ) is the output of the algorithms Since our saddle-point problem is not necessarily strongly convex in (hen =), e could not define and in the same ay as Balamurugan & Bach (26).

6 after the m-th iteration. For, it is the m-th outer iteration in Algorithm 2. The folloing to theorems are proved in Appendices C and D, respectively. Theorem 2 (Convergence rate of ). Suppose Assumption holds. If e choose = µ, = 8L min( C) b 48( b C)L 2 G, N = 52 ( C)L b 2 G µ, here L 2 and µ are defined in (3) and (4), then E ( m?, m? ) 2 4 m (?,? ) 2. 5 The overall computational cost for reaching E ( m?, m? ) is upper bounded by O n + ( C)L b 2 G 2 min ( I + A bt C b A) b d log. (9) Theorem 3 (Convergence rate of ). Suppose Assumption holds. If e choose = and = 8L min( b C) in Algorithm 3, then µ 3(8 2 ( b C)L 2 G +nµ2 ) E ( m?, m? ) 2 2( ) m (?,? ) 2, here µ 2 9(8 2 ( C)L b 2 G +nµ2 ). The total cost to achieve E ( m?, m? ) has the same bound in (9). Similar to our results in (5), both the and algorithms for policy evaluation enjoy linear convergence even if there is no strong convexity in the saddlepoint problem () (i.e., hen =). This is mainly due to the positive definiteness of b A T b C b A hen b C is positivedefinite and b A is full-rank. In contrast, the linear convergence of and in Balamurugan & Bach (26) requires the Lagrangian to be both strongly convex in and strongly concave in. Moreover, in the policy evaluation problem, the strong concavity ith respect to the dual variable comes from a eighted quadratic norm (/2)kk bc, hich does not admit an efficient proximal mapping as required by the proximal versions of and in Balamurugan & Bach (26). Our algorithms only require computing the stochastic gradients of this function, hich is easy to do due to its finite sum structure. Balamurugan & Bach (26) also proposed accelerated variants of and using the catalyst frameork of Lin et al. (25). Such extensions can be done similarly for the three algorithms presented in this paper, and e omit the details due to space limit. 6 Comparison of Different Algorithms This section compares the computation complexities of several representative policy-evaluation algorithms that minimize EM-MSPBE, as summarized in Table. Table. Complexity of different policy evaluation algorithms. In the table, d is feature dimension, n is dataset size,, ( I + ba T b C b A); G, L G/ min( I + b A T b C b A); and is a condition number related to. Algorithm / -(I) -(II) Total Complexity O nd + ( b C) 2 G n log / O (d / ) O nd ( C) b log(/ ) O nd 2 + d 2 ( C) b log(/ ) LSTD O nd 2 or O nd 2 + d 3 The upper part of the table lists algorithms hose complexity is linear in feature dimension d, including the to ne algorithms presented in the previous section. We can also apply to a finite dataset ith samples dran uniformly at random ith replacement. It costs O(d) per iteration, but has a sublinear convergence rate regarding. In practice, people may choose = (/n) for generalization reasons (see, e.g., Lazaric et al. (2)), leading to an O( nd) overall complexity for, here is a condition number related to the algorithm. Hoever, as verified by our experiments, the bounds in the table sho that our /-based algorithms are much faster as their effective condition numbers vanish hen n becomes large. TDC has a similar complexity to. In the table, e list to different implementations of. -(I) computes the gradients by averaging the stochastic gradients over the entire dataset at each iteration, hich costs O(nd) operations; see discussions at the end of Section 4. -(II) first pre-computes the matrices A, b b and C b using O(nd 2 ) operations, then computes the batch gradient at each iteration ith O(d 2 ) operations. If d is very large (e.g., hen d n), then -(I) ould have an advantage over -(II). The loer part of the table also includes LSTD, hich has O(nd 2 ) complexity if rankone updates are used. and are more efficient than the other algorithms, hen either d or n is very large. In particular, they have a loer complexity than LSTD hen d> ( + ( C) b 2 G n ) log, This condition is easy to satisfy, hen n is very large. On the other hand, and algorithms are more efficient than -(I) if n is large, say n>( C) b 2 G ( C) b, here and G are described in the caption of Table. There are other algorithms hose complexity scales linearly ith n and d, including ilstd (Geramifard et al., 27), and TDC (Sutton et al., 29b), flstd- SA (Prashanth et al., 24), and the more recent algorithms of Wang et al. (26) and Dai et al. (26). Hoever, their

7 convergence is slo: the number of iterations required to reach a desired accuracy gros as / or orse. The CTD algorithm (Korda & Prashanth, 25) uses a similar idea as to reduce variance in TD updates. This algorithm is shon to have a similar linear convergence rate in an online setting here the data stream is generated by a Markov process ith finite states and exponential mixing. The method solves for a fixed-point solution by stochastic approximation. As a result, they can be non-convergent in off-policy learning, hile our algorithms remain stable (c.f., Section 7.). 7 Extensions It is possible to extend our approach to accelerate optimization of other objectives such as MSBE and NEU (Dann et al., 24). In this section, e briefly describe to extensions of the algorithms developed earlier. 7. Off-policy learning In some cases, e may ant to estimate the value function of a policy from a set of data D generated by a different behavior policy b. This is called off-policy learning (Sutton & Barto, 998, Chapter 8). In the off-policy case, samples are generated from the distribution induced by the behavior policy b, not the the target policy. While such a mismatch often causes stochastic-approximation-based methods to diverge (Tsitsiklis & Van Roy, 997), our gradient-based algorithms remain convergent ith the same (fast) convergence rate. Consider the RL frameork outlined in Section 2. For each state-action pair (s t,a t ) such that b (a t s t ) >, e define the importance ratio, t, (a t s t )/ b (a t s t ). The EM- MSPBE for off-policy learning has the same expression as in (7) except that A t, b t and C t are modified by the eight factor t, as listed in Table 2; see also Liu et al. (25, Eqn 6) for a related discussion.) Algorithms 3 remain the same for the off-policy case after A t, b t and C t are modified correspondingly. 7.2 Learning ith eligibility traces Eligibility traces are a useful technique to trade off bias and variance in TD learning (Singh & Sutton, 996; Kearns & Singh, 2). When they are used, e can pre-compute z t in Table 2 before running our ne algorithms. Note that EM-MSPBE ith eligibility traces has the same form of (7), ith A t, b t and C t defined differently according to the last ro of Table 2. At the m-th step of the learning process, the algorithm randomly samples z tm, t m, tm and r tm from the fixed dataset and computes the corresponding stochastic gradients, here the index t m is uniformly distributed over {,...,n} and are independent for different values of m. Algorithms 3 immediately ork for this case, enjoying a similar linear convergence rate and a com- Table 2. Expressions of A t, b t and C t for different cases of policy evaluation. Here, t, (a t s t)/ b (a t s t); and z t, P t i= ( )t i i, here is a given parameter. A t b t C t On-policy t( t t ) > r t t t > t Off-policy t t( t t ) > tr t t t > t Eligibility trace z t( t t ) > r tz t t > t putation complexity linear in n and d. We need additional O(nd) operations to pre-compute z t recursively and an additional O(nd) storage for z t. Hoever, it does not change the order of the total complexity for /. 8 Experiments In this section, e compare the folloing algorithms on to benchmark problems: (i) (Algorithm ); (ii) ith samples dran randomly ith replacement from a dataset; (iii) TD: the flstd-sa algorithm of Prashanth et al. (24); (iv) (Algorithm 2); and (v) (Algorithm 3). Note that hen >, the TD solution and EM-MSPBE minimizer differ, so e do not include TD. For step size tuning, is chosen from, 2,..., 6 L ( C) b and is chosen from,, 2 max( C) b. We only report the results of each algorithm hich correspond to the best-tuned step sizes; for e choose N =2n. In the first task, e consider a randomly generated MDP ith 4 states and actions (Dann et al., 24). The transition probabilities are defined as P (s a, s) / p a ss + 5, here p a ss U[, ]. The data-generating policy and start distribution ere generated in a similar ay. Each state is represented by a 2-dimensional feature vector, here 2 of the features ere sampled from a uniform distribution, and the last feature as constant one. We chose =.95. Fig. shos the performance of various algorithms for n = 2. First, notice that the stochastic variance methods converge much faster than others. In fact, our proposed methods achieve linear convergence. Second, as e increase, the performances of, and improve significantly due to better conditioning, as predicted by our theoretical results. Next, e test these algorithms on Mountain Car (Sutton & Barto, 998, Chapter 8). To collect the dataset, e first ran Sarsa ith d = 3 CMAC features to obtain a good policy. Then, e ran this policy to collect trajectories that comprise the dataset. Figs. 2 and 3 sho our proposed stochastic variance reduction methods dominate other first-order methods. Moreover, ith better conditioning (through a larger ),, and achieve faster convergence rate. Finally, as e increase sample size n, and converge faster. This simulation verifies our

8 log TD (a) = log q (b) = max( A b> C b A) b log (c) = max ba > b C b A Figure. Random MDP ith s =4, d =2, and n =2. log TD (a) = log (b) =. max( b A > b C b A) log (c) = max( b A > b C b A) Figure 2. Mountain Car Data Set ith d =3and n =5. log TD (a) = log (b) =. max( b A > b C b A) log (c) = max( b A > b C b A) Figure 3. Mountain Car Data Set ith d =3and n =2. theoretical finding in Table that / need feer epochs for large n. 9 Conclusions In this paper, e reformulated the EM-MSPBE minimization problem in policy evaluation into an empirical saddlepoint problem, and developed and analyzed a batch gradient method and to first-order stochastic variance reduction methods to solve the problem. An important result e obtained is that even hen the reformulated saddle-point problem lacks strong convexity in primal variables and has only strong concavity in dual variables, the proposed algorithms are still able to achieve a linear convergence rate. We are not aare of any similar results for primal-dual batch gradient methods or stochastic variance reduction methods. Furthermore, e shoed that hen both the feature dimension d and the number of samples n are large, the developed stochastic variance reduction methods are more efficient than any other gradient-based methods hich are convergent in off-policy settings. This ork leads to several interesting directions for research. First, e believe it is important to extend the stochastic variance reduction methods to nonlinear approximation paradigms (Bhatnagar et al., 29), especially ith deep neural netorks. Moreover, it remains an important open problem ho to apply stochastic variance reduction techniques to policy optimization.

9 References Balamurugan, P and Bach, Francis. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems 29, pp , 26. Bertsekas, Dimitri P and Tsitsiklis, John N. Neurodynamic programming: An overvie. In Decision and Control, 995., Proceedings of the 34th IEEE Conference on, volume, pp IEEE, 995. Bhatnagar, Shalabh, Precup, Doina, Silver, David, Sutton, Richard S, Maei, Hamid R, and Szepesvári, Csaba. Convergent temporal-difference learning ith arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, pp , 29. Boyan, Justin A. Technical update: Least-squares temporal difference learning. Machine Learning, 49: , 22. Bradtke, Steven J and Barto, Andre G. Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22:33 57, 996. Chambolle, Antonin and Pock, Thomas. A first-order primal-dual algorithm for convex problems ith applications to imaging. Journal of Mathematical Imaging and Vision, 4():2 45, 2. Dai, Bo, He, Niao, Pan, Yunpeng, Boots, Byron, and Song, Le. Learning from conditional distributions via dual embeddings. arxiv: , 26. Dann, Christoph, Neumann, Gerhard, and Peters, Jan. Policy evaluation ith temporal differences: a survey and comparison. Journal of Machine Learning Research, 5 ():89 883, 24. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. : A fast incremental gradient method ith support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp , 24. Geramifard, Alborz, Boling, Michael H., Zinkevich, Martin, and Sutton, Richard S. ilstd: Eligibility traces and convergence analysis. In Advances in Neural Information Processing Systems 9, pp , 27. Gohberg, Israel, Lancaster, Peter, and Rodman, Leiba. Indefinite linear algebra and applications. Springer Science & Business Media, 26. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp , 23. Kearns, Michael J. and Singh, Satinder P. Bias-variance error bounds for temporal difference updates. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT-), pp , 2. Korda, Nathaniel and Prashanth, L.A. On TD() ith function approximation: Concentration bounds and a centered variant ith exponential convergence. In Proceedings of the Thirty-Second International Conference on Machine Learning (ICML-5), pp , 25. Lagoudakis, Michail G and Parr, Ronald. Least-squares policy iteration. Journal of Machine Learning Research, 4(Dec):7 49, 23. Lange, Sascha, Gabel, Thomas, and Riedmiller, Martin. Batch reinforcement learning. In Wiering, Marco and van Otterlo, Martijn (eds.), Reinforcement Learning: State of the Art, pp Springer Verlag, 2. Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Munos, Rémi. Finite-sample analysis of LSTD. In Proceedings of the Tenty-Seventh International Conference on Machine Learning, pp , 2. Lian, Xiangru, Wang, Mengdi, and Liu, Ji. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of Artificial Intelligence and Statistics Conference (AISTATS), pp , 27. Liesen, Jörg and Parlett, Beresford N. On nonsymmetric saddle point matrices that allo conjugate gradient iterations. Numerische Mathematik, 8(4):65 624, 28. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (NIPS) 28, pp , 25. Lin, Long-Ji. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3 4):293 32, 992. Liu, Bo, Liu, Ji, Ghavamzadeh, Mohammad, Mahadevan, Sridhar, and Petrik, Marek. Finite-sample analysis of proximal gradient TD algorithms. In Proc. The 3st Conf. Uncertainty in Artificial Intelligence, Amsterdam, Netherlands, 25. Nedić, A. and Bertsekas, Dimitri P. Least squares policy evaluation algorithms ith linear function approximation. Discrete Event Dynamics Systems: Theory and Applications, 3():79, 23. Prashanth, LA, Korda, Nathaniel, and Munos, Rémi. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Joint European

10 Conference on Machine Learning and Knoledge Discovery in Databases, pp Springer, 24. Precup, Doina, Sutton, Richard S., and Dasgupta, Sanjoy. Off-policy temporal-difference learning ith funtion approximation. In Proceedings of the Eighteenth Conference on Machine Learning (ICML-), pp , 2. Wasserman, Larry. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, 23. Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 25. Rockafellar, R. Tyrrell. Convex Analysis. Princeton University Press, 97. Shen, Shu-Qian, Huang, Ting-Zhu, and Cheng, Guang- Hui. A condition for the nonsymmetric saddle point matrix being diagonalizable and having real and positive eigenvalues. Journal of Computational and Applied Mathematics, 22():8 2, 28. Singh, Satinder P. and Sutton, Richard S. Reinforcement learning ith replacing eligibility traces. Machine Learning, 22( 3):23 58, 996. Sutton, Richard S and Barto, Andre G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 998. Sutton, Richard S, Maei, Hamid R, and Szepesvári, Csaba. A convergent o(n) temporal-difference algorithm for off-policy learning ith linear function approximation. In Advances in neural information processing systems, pp , 29a. Sutton, Richard S, Maei, Hamid Reza, Precup, Doina, Bhatnagar, Shalabh, Silver, David, Szepesvári, Csaba, and Wieiora, Eric. Fast gradient-descent methods for temporal-difference learning ith linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pp ACM, 29b. Tsitsiklis, John N. and Van Roy, Benjamin. An analysis of temporal-difference learning ith function approximation. IEEE Transactions on Automatic Control, 42: , 997. Valcarcel Macua, Sergio, Chen, Jianshu, Zazo, Santiago, and Sayed, Ali H. Distributed policy evaluation under multiple behavior strategies. Automatic Control, IEEE Transactions on, 6(5):26 274, 25. Wang, Mengdi, Liu, Ji, and Fang, Ethan. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems (NIPS) 29, pp , 26.

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

Proximal Gradient Temporal Difference Learning Algorithms

Proximal Gradient Temporal Difference Learning Algorithms Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Proximal Gradient Temporal Difference Learning Algorithms Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White, and Martha White {YANGPAN,ADAMW,MARTHA}@INDIANA.EDU Department of Computer Science, Indiana University Abstract The family of

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

Preconditioned Temporal Difference Learning

Preconditioned Temporal Difference Learning Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China hengshuai@gmail.com zq.liu@cityu.edu.hk Abstract This paper extends many of the recent popular policy

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Least-Squares Temporal Difference Learning based on Extreme Learning Machine Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent

More information

Feature Selection by Singular Value Decomposition for Reinforcement Learning

Feature Selection by Singular Value Decomposition for Reinforcement Learning Feature Selection by Singular Value Decomposition for Reinforcement Learning Bahram Behzadian 1 Marek Petrik 1 Abstract Linear value function approximation is a standard approach to solving reinforcement

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White, Martha White Department of Computer Science

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7. Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

More information

Emphatic Temporal-Difference Learning

Emphatic Temporal-Difference Learning Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University

More information

Max-Margin Ratio Machine

Max-Margin Ratio Machine JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Least squares temporal difference learning

Least squares temporal difference learning Least squares temporal difference learning TD(λ) Good properties of TD Easy to implement, traces achieve the forward view Linear complexity = fast Combines easily with linear function approximation Outperforms

More information

Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs

Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs Alborz Geramifard Christoph Dann Jonathan P. How Laboratory for Information and Decision Systems Massachusetts Institute

More information

Ensembles of Extreme Learning Machine Networks for Value Prediction

Ensembles of Extreme Learning Machine Networks for Value Prediction Ensembles of Extreme Learning Machine Networks for Value Prediction Pablo Escandell-Montero, José M. Martínez-Martínez, Emilio Soria-Olivas, Joan Vila-Francés, José D. Martín-Guerrero IDAL, Intelligent

More information

Regularization and Feature Selection in. the Least-Squares Temporal Difference

Regularization and Feature Selection in. the Least-Squares Temporal Difference Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter kolter@cs.stanford.edu Andrew Y. Ng ang@cs.stanford.edu Computer Science Department, Stanford University,

More information

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

WE consider finite-state Markov decision processes

WE consider finite-state Markov decision processes IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 7, JULY 2009 1515 Convergence Results for Some Temporal Difference Methods Based on Least Squares Huizhen Yu and Dimitri P. Bertsekas Abstract We consider

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White and Martha White Department of Computer Science Indiana University at Bloomington {yangpan,adamw,martha}@indiana.edu a high rate

More information

Accelerated Gradient Temporal Difference Learning

Accelerated Gradient Temporal Difference Learning Accelerated Gradient Temporal Difference Learning Yangchen Pan, Adam White and Martha White Department of Computer Science Indiana University at Bloomington {yangpan,adamw,martha}@indiana.edu Abstract

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1 90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,

More information

arxiv: v1 [cs.lg] 30 Dec 2016

arxiv: v1 [cs.lg] 30 Dec 2016 Adaptive Least-Squares Temporal Difference Learning Timothy A. Mann and Hugo Penedones and Todd Hester Google DeepMind London, United Kingdom {kingtim, hugopen, toddhester}@google.com Shie Mannor Electrical

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

An Analysis of Laplacian Methods for Value Function Approximation in MDPs

An Analysis of Laplacian Methods for Value Function Approximation in MDPs An Analysis of Laplacian Methods for Value Function Approximation in MDPs Marek Petrik Department of Computer Science University of Massachusetts Amherst, MA 3 petrik@cs.umass.edu Abstract Recently, a

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

arxiv: v1 [math.oc] 5 Feb 2018

arxiv: v1 [math.oc] 5 Feb 2018 for Convex-Concave Saddle Point Problems without Strong Convexity Simon S. Du * 1 Wei Hu * arxiv:180.01504v1 math.oc] 5 Feb 018 Abstract We consider the convex-concave saddle point problem x max y fx +

More information

Low-rank Feature Selection for Reinforcement Learning

Low-rank Feature Selection for Reinforcement Learning Low-rank Feature Selection for Reinforcement Learning Bahram Behzadian and Marek Petrik Department of Computer Science University of New Hampshire {bahram, mpetrik}@cs.unh.edu Abstract Linear value function

More information

A UNIFIED FRAMEWORK FOR LINEAR FUNCTION APPROXIMATION OF VALUE FUNCTIONS IN STOCHASTIC CONTROL

A UNIFIED FRAMEWORK FOR LINEAR FUNCTION APPROXIMATION OF VALUE FUNCTIONS IN STOCHASTIC CONTROL A UNIFIED FRAMEWORK FOR LINEAR FUNCTION APPROXIMATION OF VALUE FUNCTIONS IN STOCHASTIC CONTROL Matilde Sánchez-Fernández Sergio Valcárcel, Santiago Zazo ABSTRACT This paper contributes with a unified formulation

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum 1997 19 CHAPTER 3 THE COMMON FACTOR MODEL IN THE POPULATION 3.0. Introduction

More information

The Fixed Points of Off-Policy TD

The Fixed Points of Off-Policy TD The Fixed Points of Off-Policy TD J. Zico Kolter Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 kolter@csail.mit.edu Abstract Off-policy

More information

Basis Construction from Power Series Expansions of Value Functions

Basis Construction from Power Series Expansions of Value Functions Basis Construction from Power Series Expansions of Value Functions Sridhar Mahadevan Department of Computer Science University of Massachusetts Amherst, MA 3 mahadeva@cs.umass.edu Bo Liu Department of

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

y(x) = x w + ε(x), (1)

y(x) = x w + ε(x), (1) Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

In: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Abstract Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Unifying Task Specification in Reinforcement Learning

Unifying Task Specification in Reinforcement Learning Martha White Abstract Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

On the approximation of real powers of sparse, infinite, bounded and Hermitian matrices

On the approximation of real powers of sparse, infinite, bounded and Hermitian matrices On the approximation of real poers of sparse, infinite, bounded and Hermitian matrices Roman Werpachoski Center for Theoretical Physics, Al. Lotnikó 32/46 02-668 Warszaa, Poland Abstract We describe a

More information

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Accelerating Stochastic Composition Optimization

Accelerating Stochastic Composition Optimization Journal of Machine Learning Research 18 (2017) 1-23 Submitted 10/16; Revised 6/17; Published 10/17 Accelerating Stochastic Composition Optimization Mengdi Wang Department of Operations Research and Financial

More information